Introduction

This set of notes comes originally from a workshop that introduces participants to latent variable modeling techniques and graphical models more generally. However, it is now more or less a set of notes on graphical/latent variable modeling more generally, much of which may not be covered in the workshop in detail.

When first encountering these models, it may depend on one’s discipline how such models may be presented. The following will take a broad view initially, but then focus on what is usually referred to as structural equation modeling in the social and educational science literature. An attempt will be made to use a consistent, rather than discipline specific nomenclature and approach.

One of the goals of this set of notes is to not instill a false sense of comfort and/or familiarity with the techniques. No one is going to be an expert after a couple of afternoons with SEM. SEM and other techniques covered are typically taught over the course of a few weeks in a traditional statistics course, or given their own course outright. Instead, one of the primary goals here is to instill a firm conceptual foundation starting with common approaches (e.g. standard regression), while exposing the participant to a wide family of related techniques, any of which might be useful to one’s modeling and data situation.

Prerequisites for the workshop

Statistical

One should at least have a firm understanding of standard regression modeling techniques. If you are new to statistical analysis in general, I’ll be blunt and say you are probably not ready for SEM. SEM employs knowledge of maximum likelihood, multivariate analysis, measurement error, indirect effects etc., and none of this is typically covered in a first semester of statistics in many applied disciplines. SEM builds upon that basic regression foundation, and if that is not solid, SEM will probably be confusing at best.

Programming

SEM requires its own modeling language approach. As such, the syntax for Mplus and SEM specific programs, as well as SEM models within other languages or programs (e.g. R or Stata) are going to require you to learn something new.

Outline

Graphical Models

The workshop will start with the familiar, a standard regression model. It will then be presented as a graphical model, and extended to include indirect effects (e.g. \(\mathcal{A} \rightarrow \mathcal{B} \rightarrow \mathcal{C}\)). At this point we will discuss directed graphs, and demonstrate a more theoretically motivated approach (sometimes called path analysis), and compare it to a more flexible approach (Bayesian network) that does not require prespecification of paths nor specific directional relations. At this point, we’ll briefly discuss undirected graphs, with an example utilizing ‘network analysis’, though any exercise will likely be left to the participant’s own time.

Latent Variables

We will then discuss the notion of latent variables in the context of underlying causes or constructs and an understanding of the notion of measurement error. We will also note that latent variable models are actually even more broadly utilized when one includes other dimension reduction, or data compression, techniques, several of which fall under the heading of factor analysis. A few common techniques will be demonstrated such as ‘factor analysis’ and principal components analysis, and an overview will be provided for others. In addition, we will briefly discuss the relation of latent variable models to random effects models (something that will be demonstrated in more detail later), and note other places one might find latent variables (e.g. the EM algorithm, hidden Markov models).

SEM

The bulk of the rest of the workshop will focus on structural equation modeling. We will spend a good deal of time with measurement models first, comparing them to our previous efforts, and then extend those models to the case of regression with latent variables. There are many issues to consider when developing such models, and an attempt will be made to cover quite a bit of ground in that regard.

Others

Time may permit discussion of other topics within the SEM realm. For example latent growth curve models, are an alternative to a standard mixed model. Also, there are situations where the latent variable might be considered categorical, commonly called mixture models or cluster analysis, but in some specific contexts might go by other names (e.g. latent class analysis). Finally, an overview of other types of latent variable or structural equation models, such as item response theory, collaborative filtering etc. may also be provided.

Programming Language Choice

We will use R for this workshop for a variety of reasons. One is that all of the techniques mentioned thus far are fully developed within various R packages, often taking just a line or two of code to implement after the data has been prepped. Furthermore, it is freely available and will work on Windows, Mac and Linux. R is well-known as a powerful statistical modeling environment, and its flexible programming language allows for efficient data manipulation and exploration. Furthermore, it has a vast array of visualization capabilities. In short, it provides everything one might need in a single environment.

Among alternatives, Mplus is the most fully developed structural equation modeling package, and has been for years. However, it is a poor tool for data management, the University of Michigan does not have a campus-wide license for it, and most of its functionality (and all we will need for this workshop) is implemented within the lavaan family of R packages. Stata has recently provided SEM capabilities, but it is less well-developed (something that might change in time), and it still requires a license, making non-campus usage difficult or costly. SPSS Amos would perhaps be another alternative for some, but suffers the same licensing issues and is not as flexible as Mplus, historically it has lagged far behind Mplus in capabilities, and furthermore it is not supported for Unix operating systems such as Mac and Linux. Other alternatives include Proc Calis in SAS and OpenMX (another R package). Python implementation seems minimal presently. Historically, people used EQS and LISREL, but the former is no longer developed and the latter is no longer relatively widely used, as well as being a Windows-only application.

Setup for the workshop

For the following to go smoothly, you’ll need to complete the following steps precisely.

If you are not present or are bringing your laptop, you’ll need to have both R and Rstudio installed on whatever machine you’ll be using. If this will be a new experience for you, install R first, then Rstudio. For either you’ll need to choose the version appropriate to your operating system. As you go through the installation, for both just accept all defaults when prompted until the installation process begins. Once both are installed, you will only need to work with Rstudio, and it will at all times be assumed you will be using Rstudio during the workshop.

Once those are installed, proceed through the following steps.

  1. Download this zipfile, and unzip its contents to an area on your machine that you have write access to. It contains the course contents, data, etc. in a folder that will serve as an Rstudio project folder.

  2. Open Rstudio. File/Open Project/ then navigate to the folder contents you just unzipped. Click on the SEM file (should look like a blue icon, but otherwise is the SEM.Rproj file).

  3. If the file is not already opened after opening the Rstudio project, File/Open File/my_code.R . Run the one line of code there and you’re set.

The lab for this workshop has Windows machines, and so the above is enough to proceed. For *nix systems, it’s probably easiest to just install the packages as we use them.

Color coding in text

  • object or class of object
  • function
  • package
  • important term

Introduction to R

This introduction to R is very brief and only geared toward giving you some basics so that you can understand and run the code for this course.

Getting Started

Installation

As mentioned previously, to begin with R for your own machine, you just need to go R website, download it for your operating system, and install. Then go to Rstudio, download and install it. From there on you only need Rstudio to use R.

Updates occur every few months, and you should update R whenever a new version is released.

Packages

As soon as you install it, R is already the most powerful statistical environment within which to work. However, its real strength comes from the community, which has added thousands of packages that provide additional or enhanced functionality. You will regularly find packages that specifically do something you want, and will need to install them in order to use.

Rstudio provides a Packages tab, but it is usually just as or more efficient to use the install.packages function.

install.packages('mynewfavorite')

At this point there are over 7000 packages available through standard sources, and many more through unofficial ones. To start getting some ideas of what you want to use, you will want to spend time at places like CRAN Task Views, Rdocumentation, or with a list like this one.

The main thing to note is that if you want to use a package, you have to load it with the library function.

library(lazerhawk)

Sometimes, you only need the package for one thing and there is no reason to keep it loaded, in which case you can use the following approach.

packagename::packagefunction(args)

I suggest using the library function for this workshop.

You’ll get the hang of finding, installing, and using packages quite quickly. However, note that the increasing popularity and ease of using R means that packages can vary quite a bit in terms of quality, so you may need to try out a couple packages with seemingly similar functionality to find the best for your situation.

Rstudio

Rstudio is an integrated development environment (IDE) specifically geared toward R (though it works for other languages too). At the very least it will make your programming far easier and more efficient, at best you can create publish-ready documents, manage projects, create interactive website regarding your research, use version control, and much more. I have an overview here.

See Emacs Speaks Statistics for an alternative. The point is, base R is not an efficient way to use R, and you have at least two very powerful options to make your coding easier.

Key things to know for this course

R is a programming language, not a ‘stats package’

The first thing to note for those new to R is that R is a language that is oriented toward, but not specific to, dealing with statistics. This means it’s highly flexible, but does take some getting used to. If you go in with a spreadsheet style mindset, you’ll have difficulty, as well as miss out on the power it has to offer.

Never ask if R can do what you want. It can.

The only limitation to R is the user’s programming sophistication. The better you get at statistical programming the further you can explore your data and take your research. This holds for any statistical endeavor whether using R or not.

Main components: script, console, graphics device

With R, the script is where you write your R code. While you could do everything at the console, this would be difficult at best and unreproducible. The console is where the results are produced from running the script. Again you can do one-liner stuff there, such as getting help for a function. The graphics device is where visualizations are produced, and in Rstudio you have two, one for static plots, and a viewer for potentially interactive ones.

R is easy to use, but difficult to master.

If you only want to use R in an applied fashion, as we will do here, R can be very easy to use. As an example the following code would read in data and run a regression.

mydata = read.csv(file='location/myfile.csv')
regModel = lm(y ~ x, data=mydata)
summary(regModel)

The above code demonstrates that R can be as easy to use as anything else, and in my opinion, it is for any standard analysis. The nice part is that it can still be easier to use with more complex analyses you either won’t find elsewhere or will only get with crippled functionality.

Object-oriented

R is object-oriented. For our purposes this means that we create what are called objects and use functions to manipulate those objects in some fashion. In the above code we created two objects, an object that held the data and an object that held the regression results.

Objects can be anything, a character string, a list of 1000 data sets, the results of an analysis, anything.

Assignment

In order to create an object, we must assign something to it. You’ll come across two ways to do this.

myObject = something
myObject <- something

These result in the same thing, an object called myObject that contains ‘something’ in it. For all appropriate practical use, which you use is a matter of personal preference. The point is that typical practice in R entails that one assigns something to an object and then uses functions on that object to get something more from it.

Functions

Functions are objects that take input and produce some output (called a value). In the above code we used three functions: read.csv, lm, and summary. The inputs are called arguments. With read.csv, we merely gave it one argument the file name, but there are actually a couple dozen, each with some default. Type ?read.csv at the console to see the helpfile.

Classes

All objects have a certain class, which means that some functions will work in certain ways depending on the class. Consider the following code.

x = rnorm(100)
y = x + rnorm(100)
mod = lm(y ~ x)
summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.85100 -0.71800  0.01060 -0.05248  0.56450  2.66300 
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6829 -0.6876  0.0083  0.7160  2.6754 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1910     0.1109   1.722   0.0882 .  
x             0.8429     0.1065   7.916 3.83e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.108 on 98 degrees of freedom
Multiple R-squared:   0.39, Adjusted R-squared:  0.3838 
F-statistic: 62.67 on 1 and 98 DF,  p-value: 3.828e-12

In both cases we used the summary function, but get different results. There are two methods for summary at work here: summary.default and summary.lm. Which is used depends on the class of the object given as an argument to the summary function.

class(x)
[1] "numeric"
class(mod)
[1] "lm"

One of the most common classes of objects used is the data.frame. Data frames are matrices that contain multiple types of vectors that are the variables and observations of interest. The contents can be numeric, factors (categorical variables), character strings and other types of objects.

nams = c('Bob Ross', 'Bob Marley', 'Bob Odenkirk')
places = c('Dirt', 'Dirt', 'New Mexico')
yod = c(1995, 1981, 2028)

bobs = data.frame(nams, places, yod)
bobs
          nams     places  yod
1     Bob Ross       Dirt 1995
2   Bob Marley       Dirt 1981
3 Bob Odenkirk New Mexico 2028

Case sensitive

A great deal of the errors you will get when you start learning R will result from not typing something correctly. For example, what if we try summary(X)?

summary(X)
Error in summary(X): object 'X' not found

Errors are merely messages or calls for help from R. It can’t do what you ask and needs something else in order to perform the task. In this case, R is telling you it can’t find anything called X, which makes sense because the object name is lowercase \(x\). This error message is somewhat straightforward, but error messages for all programming languages typically aren’t, and depending on what you’re trying to do, it may be fairly vague. Your first thought as you start out with R should be to check the names of things if you get an error.

The lavaan package

You’ll see more later, but in order to use lavaan for structural equation modeling, you’ll need two things primarily: an object that represents the data, and an object that represents the model. It will look something like the following.

modelCode = "
  y ~ x1 + x2 + lv            # structural/regression model
  lv =~ z1 + z2 + z3 + z4     # measurement model with latent variable lv
"

library(lavaan)
mymodel = sem(modelCode, data=mydata)

The character string is our model, and it can actually be kept as a separate file (without the assignment) if desired. Using the lavaan library, we can then use one of its specific functions like cfa, sem, or growth to use the model code object (modelCode above) and the data object (mydata above).

The way to define things in lavaan can be summarized as follows.

formulaType operator mnemonic
latent variable definition =~ is measured by
regression ~ is regressed on
(residual) (co)variance ~~ is correlated with
intercept ~ 1 intercept
fixed parameter _* is fixed at the value of _

Getting help

Use ? to get the helpfile for a specific function, or ?? to do a search, possibly on a quoted phrase.

Examples:

?lm
??regression
??'nonlinear regression'

Moving forward

Hopefully you’ll get the hang of things as we run the code in later chapters. You’ll essentially only be asked to tweak code that’s already provided. This is not a ‘learning R’ course. The CSCAR group provides a couple of those if you’re interested. However, here are some exercises to make sure we start to get comfortable with it.

Exercises

Interactive version here

  1. Create an object that consists of the numbers one through five. using c(1,2,3,4,5) or 1:5

  2. Create a different object, that is the same as that object, but plus 1 (i.e. the numbers two through six).

  3. Without creating an object, use cbind and rbind, feeding your objects as arguments. For example cbind(obj1, obj2).

  4. Create a new object using data.frame just as you did rbind or cbind in #3.

  5. Inspect the class of the object, and use the summary function on it.

Summary

There’s a lot to learn with R, and the more time you spend with R the easier your research process will likely go, and the more you will learn about statistical analysis.

Graphical Modeling

A graphical model can be seen as a mathematical or statistical construct connecting nodes via edges. When pertaining to statistical models, the nodes might represent variables of interest in our data set, and edges specify the relationships among them. Visually they are depicted in the style of the following examples.

Any statistical model you’ve conducted can be expressed as a graphical model. As an example, the first graph with nodes X, Y, and Z might represent a regression model in which X and Z predict Y. The emoticon graph shows an indirect effect, and the 123 graph might represent a correlation matrix.

A key idea of a graphical model is that of conditional independence, something one should keep in mind when constructing their models. The concept can be demonstrated with the following graph.

In this graph, X is conditionally independent of Z given Y- there is no correlation between X and Z once Y is accounted for1. We will revisit this concept when discussing path analysis and latent variable models. Graphs can be directed, undirected, or mixed. Directed graphs have arrows, sometimes implying a causal flow (a difficult endeavor to demonstrate explicitly) or noting a time component. Undirected graphs merely denote relations among the nodes, while mixed graphs might contain both directional and symmetric relationships. Most of the models discussed in this course will be directed or mixed.

Directed Graphs

As noted previously, we can represent standard models as graphical models. In most of these cases we’d be dealing with directed or mixed graphs. Almost always we are specifically dealing with directed acyclic graphs, where there are no feedback loops.

Standard linear model

Let’s start with the standard linear model (SLiM), i.e. a basic regression we might estimate via ordinary least squares (but not necessarily). In this setting, we want to examine the effect of each potential predictor (x* in the graph) on the target variable (y). The following shows what the graphical model might look like.

We start with the SLiM as a way to keep us thinking about the more complex models we will see later in the same way we do other models we might be more familiar with. In what follows, we’ll show that whether we use a standard R modeling approach (via the lm function), or an SEM approach (via the sem function in lavaan), the results are identical (aside from the fact that sem is using maximum likelihood).

mcclelland = haven::read_dta('data/path_analysis_data.dta')
lmModel = lm(math21 ~ male + math7 + read7 + momed, data=mcclelland)

Here we can do the same model using the lavaan package, and while the input form will change a bit, and the output will be presented in a manner consistent with sem, the estimated parameters are identical. Note that the residual standard error in lm is the square root of the variance estimate in the lavaan output.

library(lavaan)
model = "
  math21 ~ male + math7 + read7 + momed 
"
semModel = sem(model, data=mcclelland, meanstructure = TRUE)

summary(lmModel)

Call:
lm(formula = math21 ~ male + math7 + read7 + momed, data = mcclelland)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9801 -1.2571  0.1376  1.4544  5.7471 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.84310    1.16418   4.160 4.08e-05 ***
male         1.20609    0.25831   4.669 4.44e-06 ***
math7        0.31306    0.04749   6.592 1.76e-10 ***
read7        0.08176    0.01638   4.991 9.81e-07 ***
momed       -0.01684    0.06651  -0.253      0.8    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.329 on 324 degrees of freedom
  (101 observations deleted due to missingness)
Multiple R-squared:  0.258, Adjusted R-squared:  0.2488 
F-statistic: 28.16 on 4 and 324 DF,  p-value: < 2.2e-16
summary(semModel, rsq=T)
lavaan (0.5-20) converged normally after  21 iterations

                                                  Used       Total
  Number of observations                           329         430

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  Z-value  P(>|z|)
  math21 ~                                            
    male              1.206    0.256    4.705    0.000
    math7             0.313    0.047    6.642    0.000
    read7             0.082    0.016    5.030    0.000
    momed            -0.017    0.066   -0.255    0.799

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)
    math21            4.843    1.155    4.192    0.000

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    math21            5.341    0.416   12.826    0.000

R-Square:
                   Estimate
    math21            0.258

As we will see in more detail later, SEM incorporates more complicated regression models, but at this point it has the exact same interpretation as our standard regression. As we go along, we can see the models as generalizations of those we are already well acquainted with, and so one can use that prior knowledge as a basis for understanding the newer content.

Path Analysis

Path Analysis, and thus SEM, while new to some, is in fact a very, very old technique, statistically speaking2. It can be seen as a generalization of the SLiM approach that can allow for indirect effects and multiple target variables. Path analysis also has a long history in the econometrics literature though under different names (e.g. instrumental variable regression, 2-stage least squares etc.), and through the computer science realm through the use of graphical models more generally. As such, there are many tools at your disposal for examining such models, and I’ll iterate that much of the SEM perspective on modeling comes largely from specific disciplines, while other approaches may be better for your situation.

Types of relationships

The types of potential relationships examined by path analysis can be seen below.

Aside: Tracing rule

In a recursive model, implied correlations between two variables, X1 and X2, can be found using tracing rules. Implied correlations between variables in a model are equal to the sum of the product of all standardized coefficients for the paths between them. Valid tracings are all routes between X1 and X2 that a) do not enter the same variable twice and b) do not enter a variable through an arrowhead and leave through an arrowhead. The following examples assume the variables have been standardized (variance values equal to 1), if standardization has not occurred the variance of variables passed through should be included in the product of tracings.

Consider the following variables, A, B, and C (in a dataframe called abc) with a model seen in the below diagram. We are interested in identifying the implied correlation between x and z by decomposing the relationship into its different components and using tracing rules.

cor(abc)
    A   B   C
A 1.0 0.2 0.3
B 0.2 1.0 0.7
C 0.3 0.7 1.0
model = "
  C ~ A + B
  B ~ A
"

pathMod = sem(model, data=abc)
coef(pathMod)
  C~A   C~B   B~A  C~~C  B~~B 
0.167 0.667 0.200 0.478 0.950 

To reproduce the correlation between A and C (sometimes referred to as a ‘total effect’):

  • Corr = ac + ab * ac
  • Corr = 0.167 + 0.133
  • Corr = 0.3

In SEM models, it’s important to consider how well our model-implied correlations correspond to the actual observed correlations. For over-identified models, the correlations will not be reproduced exactly, and thus can serve as a measure of how well our model fits the data. More on this later.

Multiple Targets

While relatively seldom used, multivariate linear regression3 is actually very straightforward in some programming environments such as R. Using the McClelland data, let’s try it for ourselves. First, let’s look at the data to get a sense of things.

                vars   n  mean   sd median trimmed  mad min max range  skew kurtosis   se
attention4         1 430 17.93 3.05     18   18.03 2.97   9  25    16 -0.31    -0.17 0.15
math21             2 364 11.21 2.69     11   11.30 2.97   3  17    14 -0.26    -0.18 0.14
college            3 286  0.37 0.48      0    0.34 0.00   0   1     1  0.52    -1.74 0.03
vocab4             4 386 10.18 2.53     10   10.23 2.97   4  17    13 -0.18    -0.37 0.13
math7              5 397 10.73 2.76     11   10.69 2.97   4  19    15  0.13    -0.47 0.14
read7              6 390 31.57 8.05     30   30.87 8.90  18  61    43  0.79     0.41 0.41
read21             7 360 73.67 8.52     76   74.69 7.41  35  84    49 -1.60     3.88 0.45
adopted            8 430  0.49 0.50      0    0.48 0.00   0   1     1  0.06    -2.00 0.02
male               9 430  0.55 0.50      1    0.56 0.00   0   1     1 -0.19    -1.97 0.02
momed             10 419 14.83 2.03     15   14.77 1.48  10  21    11  0.11    -0.45 0.10
college_missing   11 430  0.33 0.47      0    0.29 0.00   0   1     1  0.70    -1.52 0.02

While these are not the most strongly correlated variables to begin with, one plausible model might try to predict math and reading at age 21 with measures taken at prior years. The coefficients in the output with ~1 are the intercepts.

model = "
  read21 ~ attention4 + vocab4 + read7
  math21 ~ attention4 + math7
  read21 ~~ 0*math21
"
mvregModel  = sem(model, data=mcclelland, missing='listwise', meanstructure = T)
coef(mvregModel)
read21~attention4     read21~vocab4      read21~read7 math21~attention4      math21~math7    read21~~read21 
            0.128             0.377             0.537             0.091             0.347            51.837 
   math21~~math21          read21~1          math21~1 
            5.965            50.290             5.856 

The last line of the model code clarifies that we are treating math21 and read21 as independent. We can compare this to standard R regression. A first step is taken to make the data equal to what was used in lavaan. For that we can use the dplyr package to select the necessary variables for the model, and then omit rows that have any missing.

library(dplyr)
mcclellandComplete = select(mcclelland, read21, math21, attention4, vocab4, read7, math7) %>% 
  na.omit
lm(read21 ~ attention4 + vocab4 + read7, data=mcclellandComplete)

Call:
lm(formula = read21 ~ attention4 + vocab4 + read7, data = mcclellandComplete)

Coefficients:
(Intercept)   attention4       vocab4        read7  
    50.2904       0.1275       0.3770       0.5368  
lm(math21 ~ attention4 + math7, data=mcclellandComplete)

Call:
lm(formula = math21 ~ attention4 + math7, data = mcclellandComplete)

Coefficients:
(Intercept)   attention4        math7  
    5.85633      0.09103      0.34732  

However, we can and probably should estimate the covariance of math and reading skill at age 21. Let’s rerun the path analysis removing that constraint.

model = "
  read21 ~ attention4 + vocab4 + read7
  math21 ~ attention4 + math7
"
mvregModel  = sem(model, data=mcclelland, missing='listwise', meanstructure = T)
coef(mvregModel)
read21~attention4     read21~vocab4      read21~read7 math21~attention4      math21~math7    read21~~read21 
            0.140             0.388             0.494             0.092             0.330            51.958 
   math21~~math21    read21~~math21          read21~1          math21~1 
            5.968             3.202            51.316             6.020 

We can see now that the coefficients are now slightly different from the SLiM approach. The read21~~math21 value represents the residual covariance between math and reading at age 21, i.e. after accounting for the other covariate relationships modeled, it tells us how correlated those skills are. Using summary will show it to be statistically significant (not shown here).

summary(mvregModel)

Whether or not to take a multivariate/path-analytic approach vs. separate regressions is left to the researcher. Assuming multivariate normality is perhaps less tenable than using the assumption for a single variable, and it’s easy to conduct univariate models. But as the above shows, it doesn’t take much to take into account correlated target variables.

Indirect Effects

So path analysis allows for multiple target variables, with the same or a mix of covariates for each target. What about indirect effects? Normal regression models examine direct effects only, and the regression coefficients reflect that direct effect. However, perhaps we think a particular covariate causes some change in another, which then causes some change in the target variable. This is especially true when some measures are collected at different time points. Note that in SEM, any variable in which an arrow is pointing to it in the graphical depiction is often called an endogenous variable, while those that only have arrows going out from them are exogenous. Exogenous variables may still have (unanalyzed) correlations among them. As we will see later, both observed and latent variables may be endogenous or exogenous.

Consider the following model.

Here we posit attention span and vocabulary at age 4 as indicative of what to expect for reading skill at age 7, and that is ultimately seen as a precursor to adult reading ability. In this model, attention span and vocabulary at 4 only have an indirect effect on adult reading ability through earlier reading skill. At least temporally it makes sense, so let’s code this up.

model = "
  read21 ~ read7
  read7 ~ attention4 + vocab4
"

mediationModel  = sem(model, data=mcclelland)
summary(mediationModel, rsquare=TRUE)
lavaan (0.5-20) converged normally after  21 iterations

                                                  Used       Total
  Number of observations                           305         430

  Estimator                                         ML
  Minimum Function Test Statistic                6.513
  Degrees of freedom                                 2
  P-value (Chi-square)                           0.039

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  Z-value  P(>|z|)
  read21 ~                                            
    read7             0.559    0.050   11.152    0.000
  read7 ~                                             
    attention4        0.270    0.151    1.791    0.073
    vocab4            0.488    0.186    2.629    0.009

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    read21           53.047    4.296   12.349    0.000
    read7            66.779    5.408   12.349    0.000

R-Square:
                   Estimate
    read21            0.290
    read7             0.035

What does this tell us? As before, we interpret the results as we would any other regression model, though conceptually there are two sets of models to consider (though they are estimated simultaneously4), one for reading at age 7 and one for reading at age 21. And indeed, one can think of path analysis as a series of linked regression models. Here we have positive relationships between attention and vocab on reading at age 7, and a positive effect of reading at age 7 on reading at age 21. Statistically speaking, our model appears to be viable, as there appear to be statistically significant estimates (or nearly so) for each path.

However, look at the R2 value for reading at age 7. We now see that there are actually no practical effects of the age 4 variables at all, as all we are accounting for is < 4% of the variance, and all that we have really discovered is that prior reading ability affects later reading ability.

We can test the indirect effect itself by labeling the paths. In the following code, I label them based on the first letter of the variables involved (e.g. vr refers to the vocab to reading path), but note that these are arbitrary names. I also add the direct effects of the early age variable. While the indirect effect for vocab is statistically significant, as we already know there is not a strong correlation between these two variables, it’s is largely driven by the strong relationship between reading at age 7 and reading at age 21, which is probably not all that interesting. A comparison of AIC values, something we’ll talk more about later, would favor a model with only direct effects5.

model = "
  read21 ~ rr*read7 + attention4 + vocab4
  read7 ~ ar*attention4 + vr*vocab4
  
  # Indirect effects
  att4_read21 := ar*rr
  vocab4_read21 := vr*rr
"

mediationModel  = sem(model, data=mcclelland)
summary(mediationModel, rsquare=TRUE, fit=T, std=T)
lavaan (0.5-20) converged normally after  28 iterations

                                                  Used       Total
  Number of observations                           305         430

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0

Model test baseline model:

  Minimum Function Test Statistic              121.544
  Degrees of freedom                                 5
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3602.058
  Loglikelihood unrestricted model (H1)      -3602.058

  Number of free parameters                          7
  Akaike (AIC)                                7218.115
  Bayesian (BIC)                              7244.157
  Sample-size adjusted Bayesian (BIC)         7221.957

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent Confidence Interval          0.000  0.000
  P-value RMSEA <= 0.05                          1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
  read21 ~                                                              
    read7     (rr)    0.536    0.050   10.607    0.000    0.536    0.515
    attentin4         0.134    0.134    0.998    0.318    0.134    0.015
    vocab4            0.381    0.166    2.299    0.021    0.381    0.044
  read7 ~                                                               
    attentin4 (ar)    0.270    0.151    1.791    0.073    0.270    0.033
    vocab4    (vr)    0.488    0.186    2.629    0.009    0.488    0.059

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
    read21           51.926    4.205   12.349    0.000   51.926    0.695
    read7            66.779    5.408   12.349    0.000   66.779    0.965

R-Square:
                   Estimate
    read21            0.305
    read7             0.035

Defined Parameters:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
    att4_read21       0.145    0.082    1.766    0.077    0.145    0.017
    vocab4_read21     0.261    0.102    2.552    0.011    0.261    0.030

In the original article, I did not find their description or diagrams of the models detailed enough to know precisely what the model was in the actual study6, but here is at least one interpretation if you’d like to examine it further.

modReading = "
  read21 ~ read7 + attention4 + vocab4 + male + adopted + momed
  read7 ~ attention4 + vocab4 + male + adopted + momed 
"
reading  = sem(modReading, data=mcclelland, missing='fiml', mimic = 'Mplus', std.ov=TRUE)
summary(reading, rsquare=TRUE)

A note about terminology: some refer to models with indirect effects as mediation models, and that terminology appears commonly in the SEM (and esp. psychology) literature (along with the notion of ‘buffering’). Many applied researchers starting out with SEM often confuse the term with moderation, which is called an interaction in every other modeling context. As you start out, referring to indirect effects and interactions will likely keep you clear on what you’re modeling, and perhaps be clearer to those who may not be as familiar with SEM. Also, in moderation models, one will often see some variable denoted as ‘the moderator’, but this is completely arbitrary. In an interaction, it makes just as much sense to say that the A-Y relationship varies as a function of B as it does the B-Y relationship varies as a function of A.

Cavets about indirect effects

One should think very hard about positing an indirect effect, especially if there is no time component or obvious causal path. If the effect isn’t immediately obvious, then one should probably just examine the direct effects. Unlike other standard explorations one might do with models (e.g. look for nonlinear relationships), the underlying causal connection is more explicit in this context. Many models I’ve seen in consulting struck me as arbitrary as far as which covariate served as the mediator, required a fairly convoluted explanation for its justification, or ignored other relevant variables because the reasoning would have to include a plethora of indirect effects if it were to be logically consistent. Furthermore, I can often ask one or two questions and will discover that researchers are actually interested in interactions (i.e. moderation), rather than indirect effects.

This document will not get into models that have moderated mediation and mediated moderation. In my experience these are often associated with models that are difficult to interpret at best, or are otherwise not grounded very strongly in substantive concerns. However, there are times, e.g. in experimental settings, which surprisingly few SEM are applied to, where it would be very much appropriate. It is left to the reader to investigate those particular complications when the time comes.

Bayesian Networks

In many cases of path analysis, the path model is not strongly supported by prior research or intuition, and people are also often willing to use modification indices after the fact to change the paths in their model. This is unfortunate, as their model is generally overfit to begin with, and more so if altered in such an ad hoc fashion.

A more exploratory approach to graphical modeling is available however. Bayesian Networks are an alternative to graphical modeling of the sort we’ve been doing. Though they can be used to produce exactly the same results that we obtain with path analysis via maximum likelihood estimation, they can also be used for constrained or wholly exploratory endeavors as well, with regularization in place to keep from overfitting.

As an example, I use the McClelland data to explore potential paths via the bnlearn package. I make the constraints that variables later in time do not effect variables earlier in time, no variables are directed toward background characteristics like sex, and at least for these purposes I keep math and reading at a particular time from having paths to each other. I show some of the so-called blacklist of constraints, and use the bnlearn package for the model.

    from         to
1 read21      read7
2 read21      math7
3 read21 attention4
4 read21     vocab4
5 read21    adopted
6 read21       male
library(bnlearn)
model = gs(mcclellandNoCollege, blacklist = blacklist, test='mi-g-sh')
# plot(model)

The plot of the model results shows that attention span at age 4 has no useful relationship to the other variables, something we’d already suspected based on previous models, and even could guess at the outset given its low correlations. As it has no connections, I’ve dropped it from the visualization. Furthermore, the remaining paths make conceptual sense. The parameters, fitted values, and residuals can be extracted with the bn.fit function, and other diagnostic plots, cross-validation and prediction on new data are also available.

We won’t get into the details of these models except to say that one should have them in their tool box. And if one really is in a more exploratory situation, the tools available would typically come with methods far better suited for it than the SEM software approach. The discovery process with Bayesian networks can also be a lot of fun. Even if one has strong theory, nature is always more clever than we are, and you might find something interesting.

Undirected Graphs

So far we have been discussing directed graphs in which the implied causal flow tends toward one direction and there are no feedback loops. However, sometimes the goal is not so much to estimate the paths as it is to find the structure. Undirected graphs simply specify the relations of nodes with edges, but without any directed arrows regarding the relationship.

While we could have used the bnlearn package for an undirected graph by adding the argument undirected = T, there are a slew of techniques available for what is often called network analysis. Often the focus is on observations, rather than variables, and what influences whether one sees a tie or not, with modeling techniques available for predicting ties (e.g. Exponential Random Graph models). Often these are undirected graphs and that is our focus here, but they do not have to be.

Network analysis

Networks can be seen everywhere. Personal relationships, machines and devices, various business and academic units… we can analyze the connections among any number of things. A starting point for a very common form of network analysis is an adjacency matrix, which represents connections among items we wish to analyze. Often it is just binary 0-1 values where 1 represents a connection. Any similarity matrix could potentially be used (e.g. a correlation matrix). Here is a simple example of an adjacency matrix:

  Bernadette David Josh Lagia Mancel Nancy
Bernadette 1 0 1 1 1 1
David 0 1 1 0 0 1
Josh 1 1 1 0 1 0
Lagia 1 0 0 1 0 1
Mancel 1 0 1 0 1 0
Nancy 1 1 0 1 0 1

Visually, we can see the connections among the nodes.

As an example of a network analysis, let’s look at how states might be more or less similar on a few variables. We’ll use the state.x77 data set in base R. It is readily available, no need for loading. To practice your R skills, use the function str on the state.x77 object to examine its structure, and head to see the first 6 rows, and ? to find out more about it.

Here are the correlations of the variables.

The following depicts a graph of the states based on the variables of Life Expectancy, Median Income, High School Graduation Rate, and Illiteracy. The colors represent the results of a community detection algorithm, and serves to show the clustering is not merely geographical, though one cluster is clearly geographically oriented (‘the South’).

Understanding Networks

Networks can be investigated in an exploratory fashion or lead to more serious modeling approaches. The following is a brief list of common statistics or modeling techniques.

Centrality
  • Degree: how many links a node has (can also be ‘indegree’ or ‘outdegree’ for directed graphs)
  • Closeness: how close a node is to other nodes
  • Betweenness: how often a node serves as a bridge between the shortest path between two other nodes
  • PageRank: From Google, a measure of node ‘importance’
  • Hub: a measure of the value of a node’s links
  • Authority: another measure of node importance

Characteristics of the network as a whole may also be examined, e.g. degree distribution, ‘clusteriness’, average path length etc.

Cohesion

Investigate how network members create communities and cliques. This is similar to cluster analysis used in other situations. Some nodes may be isolated.

Modeling
  • ERGM: exponential random graph models, regression modeling for network data
  • Other link analysis
Comparison

A goal might be to compare multiple networks to see if they differ in significant ways.

Dynamics

While many networks are ‘static’, many others change over time. One might be interested in this structural change by itself, or modeling something like link loss.

Nonrecursive Models

Recursive models have all unidirectional causal effects and disturbances are not correlated. A model is considered nonrecursive if there is a reciprocal relationship, feedback loop, or correlated disturbance in the model7. Nonrecursive models are potentially problematic when there is not enough information to estimate this model (unidentified model).

A classic example of a nonrecursive relationship is marital satisfaction: the more satisfied one partner is, the more satisfied the other, and vice versa. This can be represented by a simple model (below).

Such models are notoriously difficult to specify in terms of identification, which we will talk more about later. For now, we can simply say the above model would not even be estimated as there are more parameters to estimate (two paths, two variances) than there is information in the data (two variances and one covariance).

To make this model identified, one approach is to use what are called instrumental variables. Instrumental variables directly influence one of the variables in a recursive relationship, but not the other. For example, a wife’s education can influence her satisfaction directly and a husband’s education can influence his satisfaction directly, but a husband’s education cannot directly impact a wife’s satisfaction and vice versa (at least for this demonstration). These instrumental variables can indirectly impact a spouses’ satisfaction (below). The dashed line represents an unanalyzed correlation.

Many instances of nonrecursive models might better be represented by a correlation. One must have a very strong theoretical motivation for such models, which is probably why they aren’t seen as often in the SEM literature, though they are actually quite common in some areas such as economics.

Summary

Path analysis in SEM is a form of theoretically motivated graphical model involving only observed variables. They might include indirect effects and multiple outcomes of interest. However, path analysis is a special case of a more broad set of graphical modeling tools widely used in many disciplines, any of which might be useful for a particular data situation.

R packages used

  • lavaan
  • bnlearn
  • network
  • mediation (not used but one to note)

Latent Variables

Not everything we want to measure comes with an obvious yardstick. If one wants to measure something like a person’s happiness, what would they have at their disposal?

  • Are they smiling?
  • Did they just get a pay raise?
  • Are they interacting well with others?
  • Are they relatively healthy?

Any of these might be useful as an indicator of their current state of happiness, but of course none of them would tell us whether they truly are happy or not. At best, they can be considered imperfect measures. If we consider those and other indicators collectively, perhaps we can get an underlying measure of something we might call happiness, contentment, or some other arbitrary but descriptive name.

Despite how they are typically used within psychology, educational and related fields, the use of latent variable models are actually seen all over, and in ways that may have little to do with what we will be focusing on in this course. Broadly speaking, factor analysis can be seen as a dimension reduction technique, or as an approach to modeling measurement error and understanding underlying constructs. We will give some description of the former while focusing on the latter.

Dimension Reduction/Compression

Many times we simply have the goal of taking a whole lot of variables, reducing them to much fewer, but while retaining as much information about the originals as possible. For example, this is an extremely common goal in areas of image and audio compression. Statistical techniques amenable to these approaches are commonly referred to as matrix factorization.

Principal Components Analysis

Probably the most commonly used factor-analytic technique is principal components analysis (PCA). It seeks to extract components from a set of variables, with each component containing as much of the original variance as possible. Components can be seen as a linear combination of the original variables.

PCA works on a covariance/correlation matrix, and it will return as many components as there are variables that go into it, each subsequent component accounting for less variance than the previous component, and summing up to 100% of the total variance in the original data. With appropriate steps, the components can completely reproduce the original correlation matrix. However, as the goal is dimension reduction, we only want to retain some of these components, and so the reproduced matrix will not be exact. This however gives us some sense of a goal to shoot for, and the same idea is also employed in factor analysis/SEM, where we also work with the covariance matrix and prefer models that can closely reproduce the original matrix seen with the observed data.

Visually, we can display PCA as a graphical model. Here is one with four components/variables. The size of the components represents the amount of variance each accounts for.

Let’s see an example. The following regards a correlation matrix8 of 24 psychological tests given to 145 seventh and eight-grade children in a Chicago suburb by Holzinger and Swineford. We’ll use the psych package, and to use the principal function, we provide the data (available via the psych package), specify the number of components/factors we want to retain, and other options (in this case, the rotated solution will be a little more interpretable9). We will use the psych package here as it gives us a little more output than standard PCA packages and functions, and one that is more consistent with the factor analysis technique we’ll spend time with later. While we will use lavaan for factor analysis to be consistent with the SEM approach, the psych package is a great tool for standard factor analysis, assessing reliability and other fun stuff.

library(psych)
pc = principal(Harman74.cor$cov, nfactors=4,  rotate='varimax')
pc
Principal Components Analysis
Call: principal(r = Harman74.cor$cov, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
                         RC1   RC3   RC2  RC4   h2   u2 com
VisualPerception        0.16  0.71  0.23 0.14 0.60 0.40 1.4
Cubes                   0.09  0.59  0.08 0.03 0.37 0.63 1.1
PaperFormBoard          0.14  0.66 -0.04 0.11 0.47 0.53 1.2
Flags                   0.25  0.62  0.09 0.03 0.45 0.55 1.4
GeneralInformation      0.79  0.15  0.22 0.11 0.70 0.30 1.3
PargraphComprehension   0.81  0.18  0.07 0.21 0.73 0.27 1.2
SentenceCompletion      0.85  0.15  0.15 0.06 0.77 0.23 1.1
WordClassification      0.64  0.31  0.24 0.11 0.57 0.43 1.8
WordMeaning             0.84  0.16  0.06 0.19 0.78 0.22 1.2
Addition                0.18 -0.13  0.83 0.12 0.76 0.24 1.2
Code                    0.18  0.05  0.63 0.37 0.57 0.43 1.8
CountingDots            0.02  0.17  0.80 0.05 0.67 0.33 1.1
StraightCurvedCapitals  0.18  0.41  0.62 0.03 0.59 0.41 1.9
WordRecognition         0.23 -0.01  0.06 0.68 0.52 0.48 1.2
NumberRecognition       0.12  0.08  0.05 0.67 0.48 0.52 1.1
FigureRecognition       0.06  0.46  0.05 0.58 0.55 0.45 1.9
ObjectNumber            0.14  0.01  0.24 0.68 0.54 0.46 1.4
NumberFigure           -0.02  0.32  0.40 0.50 0.51 0.49 2.7
FigureWord              0.14  0.25  0.20 0.42 0.30 0.70 2.4
Deduction               0.43  0.43  0.09 0.30 0.47 0.53 2.8
NumericalPuzzles        0.18  0.42  0.50 0.17 0.49 0.51 2.5
ProblemReasoning        0.42  0.41  0.13 0.29 0.45 0.55 3.0
SeriesCompletion        0.42  0.52  0.25 0.20 0.55 0.45 2.7
ArithmeticProblems      0.40  0.14  0.55 0.26 0.55 0.45 2.5

                       RC1  RC3  RC2  RC4
SS loadings           4.16 3.31 3.22 2.74
Proportion Var        0.17 0.14 0.13 0.11
Cumulative Var        0.17 0.31 0.45 0.56
Proportion Explained  0.31 0.25 0.24 0.20
Cumulative Proportion 0.31 0.56 0.80 1.00

Mean item complexity =  1.7
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.06 

Fit based upon off diagonal values = 0.97

First focus on the last portion of the output where it says SS loadings . The first line is the sum of the squared loadings for each component (in this case where we are using a correlation matrix, summing across all 24 components would equal the value of 24). The Proportion Var tells us how much of the overall variance the component accounts for out of all the variables (e.g. 4.16 / 24 = 0.17). The Cumulative Var tells us that all 4 components make up over half the variance. The others are the same thing just based on the four components rather than all 24 variables. We can see that each component accounts for a decreasing amount of variance.

Loadings in this scenario represent the estimated correlation of an item with its component, and provide the key way in which we interpret the factors. However, we have a lot of them, and rather than interpret that mess in our output, we’ll look at it visually. In the following plot, stronger loadings are indicated by blue, and we can see the different variables associated with different components.

Interpretation is the fun but commonly difficult part. As an example, we can see PC2 as indicative of mathematical ability, but in general this isn’t the cleanest result.

Some explanation of the other parts of the output:

  • h2: the amount of variance in the item explained by the (retained) components. It is the sum of the squared loadings (a.k.a. communality).
  • u2: 1 - h2
  • com: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure)

We can get a quick graphical model display as follows:

## fa.diagram(pc, cex=.5)

PCA is probably not the best choice in this scenario, nor likely, is a 4 factor solution. One of the primary reasons is that this graphical model assumes the observed variables are measured without error. In addition, the principal components do not correlate with one another, but it seems likely that we would want to allow the latent variables to do so (a different rotation would allow this). However, if our goal is merely to reduce the 24 items to a few that account for the most variance, this would be a standard technique.

Other Matrix Factorization Techniques

Before leaving PCA for factor analysis of the sort we’ll mostly be concerned with, I’ll mention other matrix factorization techniques that might be of use depending on your data situation.

  • SVD: singular value decomposition. Works on a raw data matrix rather than covariance matrix, and is still a very viable technique that may perform better in a lot of situations relative to fancier latent variable models, and other more recently developed techniques. Variations on SVD are behind some recommender systems of the sort you come across at Amazon, Netflix etc.
  • ICA: Independent components analysis. Extracts non-normal, independent components. The primary goal is to create independent latent variables.
  • Generalized PCA: PCA techniques for different types of data, e.g. binary data situations.
  • PC Regression: combining PCA with regression in one model.
  • NMF: non-negative matrix factorization. Applied to positive valued matrices, produces positive valued factors. Useful, for example, when dealing with counts.
  • LSI: Latent Semantic Indexing, an early form of topic modeling.
  • Many others.

Factor Analysis

Factor analysis is a general technique for uncovering latent variables within data. While initially one might think it similar to PCA10, one difference from PCA is that the goal is not to recover maximum variance with each component. However, we will move beyond factor analysis as a dimension reduction technique (and fully ‘exploratory’ technique, see below), and instead present it as an approach with a strong theoretical underpinning, and one that can help us assess measurement error, ultimately even leading to regression models utilizing the latent variables themselves.

So let us turn to what are typically called measurement models within SEM. The underlying model can be thought of as a case in which the observed variables, in some disciplines referred to as indicators of the latent construct (also manifest variables), are caused by the latent variable. The degree to which the observed variables correlate with one another depends in part on how much of the underlying (presumed) latent variable they reliably measure.

For each indicator we can think of a regression model as follows, where \(\beta_0\) is the intercept and \(\lambda\) the regression coefficient that expresses the effect of the latent variable \(F\) on the observed variable \(X\) (I do not note the error).

\[X = \beta_0 + \lambda F\]

We will almost always have multiple indicators, and often multiple latent variables. Some indicators may be associated with multiple factors.

\[\begin{aligned} X_1 &= \beta_{01} + \lambda_{11} F_1 + \lambda_{21} F_2 \\ X_2 &= \beta_{02} + \lambda_{12} F_1 + \lambda_{22} F_2 \\ X_3 &= \beta_{03} + \lambda_{13} F_1 \end{aligned}\]

It is important to understand this regression model, because many who engage in factor analysis seemingly do not, and often think of it the other way around, where the observed variables cause the latent. In factor analysis, the \(\lambda\) coefficients are called loadings (as they were in PCA), but are interpreted as any other regression coefficient- a one unit change in the latent variable results in a \(\lambda\) change in the observed variable. Most factor models assume that, controlling for the latent variable, the observed variables are independent (recall our previous discussion on conditional independence in graphical models), though this is sometimes relaxed.

Exploratory vs. Confirmatory

An unfortunate and unhelpful distinction in some disciplines is that of exploratory vs. confirmatory factor analysis (and even exploratory SEM). In any regression analysis, there is a non-zero correlation between any variable and some target variable. We don’t include everything for theoretical (and even practical) reasons, which is akin to fixing its coefficient to zero, and here it is no different. Furthermore, most modeling endeavors could be considered exploratory, regardless of how the model is specified. As such, this distinction doesn’t tell us anything about the model, and is thus unnecessary in my opinion.

As an example, \(X_3\) is not modeled by \(F_2\), which is the same as fixing the \(\lambda_{23}\) coefficient for \(F2\) to 0, but that doesn’t tell me whether the model is exploratory or not, and yet that is all the distinction refers to, namely, whether we let all factors load with all indicators or not. An analysis doesn’t suddenly have more theoretical weight, validity etc. due to the paths specified.

Example

Let’s see a factor analysis in action. The motivating example for this section comes from the National Longitudinal Survey of Youth (1997, NLSY97), which investigates the transition from youth to adulthood. For this example, we will investigate a series of questions asked to the participants in 2006 pertaining to the government’s role in promoting well-being. Questions regarded the government’s responsibility for the following: providing jobs for everyone, keeping prices under control, providing health care, providing for elderly, helping industry, providing for unemployed, reducing income differences, providing college financial aid, providing decent housing, protecting the environment. Each item has four values 1:4, which range from ‘definitely should be’ to ‘definitely should not be’11. We’ll save this for the exercise.

There are also three items regarding their emotional well-being (depression)- how often the person felt down or blue, how often they’ve been a happy person, and how often they’ve been depressed in the last month. These are also four point scales and range from ‘all of the time’ to ‘none of the time’. We’ll use this here.

depressed = read.csv('data/nlsy97_depressedNumeric.csv')

library(lavaan)
modelCode = "
  depressed =~ FeltDown + BeenHappy + DepressedLastMonth
"
famod = cfa(modelCode, data=depressed)
summary(famod, standardized=T)
lavaan (0.5-20) converged normally after  19 iterations

                                                  Used       Total
  Number of observations                          7183        8985

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
  depressed =~                                                          
    FeltDown          1.000                               0.541    0.813
    BeenHappy        -0.732    0.020  -37.329    0.000   -0.396   -0.609
    DeprssdLstMnth    0.719    0.019   37.992    0.000    0.388    0.655

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
    FeltDown          0.150    0.007   20.853    0.000    0.150    0.339
    BeenHappy         0.266    0.006   46.489    0.000    0.266    0.629
    DeprssdLstMnth    0.201    0.005   41.606    0.000    0.201    0.571
    depressed         0.292    0.010   30.221    0.000    1.000    1.000

Raw results

In a standard measurement model such as this we must scale the factor by fixing one of the indicator’s loadings to one. This is done for identification purposes, so that we can estimate the latent variable variance. Which variable is selected for scaling is arbitrary, but doing so means that the sum of the latent variable variance and the residual variance of the variable whose loading is fixed to one equals the variance of that observed variable12.

var(depressed$FeltDown, na.rm=T)  # .29 + .15
[1] 0.441856

Standardized latent variable

An alternative way to scale the latent variable is to simply fix its variance to one (the std.lv=TRUE results). It does not need to be estimated, allowing us to obtain loadings for each observed variable.

Standardized latent and observed

With both standardized (using the summary function, set standardized=T), these loadings represent correlations between the observed and latent variables. This is the default output in the factor analysis we’d get from non-SEM software (i.e. ‘exploratory’ FA). If one is just doing a factor-analytic model, these loadings are typically reported. Standardized coefficients in a CFA are computed by taking the unstandardized coefficient (loading) and multiplying it by the model implied standard deviation of the indicator then dividing by the latent variable’s standard deviation. Otherwise, one can simply use standaridized variables in the analysis, or supply only the correlation matrix.

If you’d like to peel back the curtain and see maximum likelihood estimation based on the raw (covariance) and standardized (correlation) scales, with a comparison to lavaan output, click here.

Constructs and Measurement models

Scale development

A good use of factor analysis regards scale development. If we come up with 10 items that reflect some underlying construct, factor analysis can provide a sense of how well the scale is put together. Recall that in path analysis, residual variance, sometimes called disturbance, reflects both omitted causes as well as measurement error. In this context, \(1-R^2_{item}\) provides a sense of how unreliable the item is. A perfectly reliable item would have a value of zero. Strong loadings indicate a strong relationship between the item and the underlying construct.

Factor Scores

In factor analysis, we can obtain estimated factor scores for each observation, possibly to be used in additional analyses or examined in their own right. One common way is to simply use the loadings as one would regression weights/coefficients (actually scaled versions of them), and create a ‘predicted’ factor score as the linear combination of the indicator variables, just as we would in regression.

vs. Means/Sums

In many occasions, people reduce the number of variables in a model by using a mean or sum score. These actually can be seen to reflect an underlying factor analysis where all loadings are fixed to be equal and the residual variance of the observed variables is fixed to zero, i.e. perfect measurement. If you really think the items reflect a particular construct, you’d probably be better off using a score that comes from a model that doesn’t assume perfect measurement.

vs. Composites

Composites scores are what we’d have if we turned the arrows around, and allowed different weights for the different variables, which may not be similar too similar in nature or necessarily correlated (e.g. think of how one might construct a measure of socioeconomic status). Unlike a simple mean, these would have different weights associated with the items. PCA is one way one could create such a composite.

Some Other Uses of Latent Variables

  • EM algorithm: A very common technique to estimate model parameters for a variety of model situations, it incorporates a latent variable approach where parameters of interest are treated as a latent variable (e.g. probability of belonging to some cluster).

  • Item Response Theory: uses latent variables, especially in test situations (though is much broader), to assess things like item discrimination, student ability etc.

  • Hidden Markov Model: A latent variable model approach commonly used for time series.

  • Topic Model: In the analysis of text, one can discover latent ‘topics’ based on the frequency of words.

  • Collaborative Filtering: For example, in recommender systems for movies or music, the latent variable might represent genre.

Summary

Latent variable approaches are a necessary tool to have in your statistical toolbox. Whether your goal is to compress data or explore underlying constructs, ‘factor-analysis’ will serve you well.

R packages used

  • psych
  • lavaan

Structural Equation Modeling

Structural equation modeling combines the path analytic and latent variable techniques together to allow for regression models among latent variables. Any model, even the SLiM, can be seen as some form of SEM. However, the term is typically reserved for the combination of latent and observed variables in a model.

Measurement model

The measurement model refers to the latent variable models, i.e. factor analysis, and typical practice in SEM is to investigate these separately and first. The reason is that one wants to make sure that the measurement model holds before going any further with the underlying constructs. For example, for one’s sample of data one might detect two latent variables work better for a set of indicators, or might find that some indicators are performing poorly.

Structural model

The structural model specifies the relations among latent and observed variables that do not serve as indicators. It can become quite complex, but at this stage one can lean on what they were exposed to with path analysis, as conceptually we’re in the same place, except now some variables may be latent.

Example

The following model is a classic example from Wheaton et al. (1977), which used longitudinal data to develop a model of the stability of alienation from 1967 to 1971, accounting for socioeconomic status as a covariate. Each of the three factors have two indicator variables, SES in 1966 is measured by education and occupational status in 1966 and alienation in both years is measured by powerlessness and anomia. The structural component of the model hypothesizes that SES in 1966 influences both alienation in 1967 and 1971 and alienation in 1967 influences the same measure in 1971. We also let the disturbances correlate from one timepoint to the next.

wheaton.model <- '
  # measurement model
    ses     =~ education + sei
    alien67 =~ anomia67 + powerless67
    alien71 =~ anomia71 + powerless71
 
  # structural model
    alien71 ~ aa*alien67 + ses
    alien67 ~ sa*ses
 
  # correlated residuals
    anomia67 ~~ anomia71
    powerless67 ~~ powerless71

  # Indirect effect
    IndirectEffect := sa*aa
'

The standardized results of the structural model are shown below, and model results below that. In this case, the structural paths are statistically significant, as is the indirect effect specifically. Higher socioeconomic status is affiliated with less alienation, while there is a notable positive relationship of prior alienation with later alienation. We are also accounting for roughly 50% of the variance in 1971 alienation. Colors represent positive vs. negative weights, and the closer to zero the more faded they are.

lavaan (0.5-20) converged normally after  73 iterations

  Number of observations                           932

  Estimator                                         ML
  Minimum Function Test Statistic                4.735
  Degrees of freedom                                 4
  P-value (Chi-square)                           0.316

Model test baseline model:

  Minimum Function Test Statistic             2133.722
  Degrees of freedom                                15
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       0.999

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -15213.274
  Loglikelihood unrestricted model (H1)     -15210.906

  Number of free parameters                         17
  Akaike (AIC)                               30460.548
  Bayesian (BIC)                             30542.783
  Sample-size adjusted Bayesian (BIC)        30488.792

Root Mean Square Error of Approximation:

  RMSEA                                          0.014
  90 Percent Confidence Interval          0.000  0.053
  P-value RMSEA <= 0.05                          0.930

Standardized Root Mean Square Residual:

  SRMR                                           0.007

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  ses =~                                              
    education         1.000                           
    sei               5.219    0.422   12.364    0.000
  alien67 =~                                          
    anomia67          1.000                           
    powerless67       0.979    0.062   15.895    0.000
  alien71 =~                                          
    anomia71          1.000                           
    powerless71       0.922    0.059   15.498    0.000

Regressions:
                   Estimate  Std.Err  Z-value  P(>|z|)
  alien71 ~                                           
    alien67   (aa)    0.607    0.051   11.898    0.000
    ses              -0.227    0.052   -4.334    0.000
  alien67 ~                                           
    ses       (sa)   -0.575    0.056  -10.195    0.000

Covariances:
                   Estimate  Std.Err  Z-value  P(>|z|)
  anomia67 ~~                                         
    anomia71          1.623    0.314    5.176    0.000
  powerless67 ~~                                      
    powerless71       0.339    0.261    1.298    0.194

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    education         2.801    0.507    5.525    0.000
    sei             264.597   18.126   14.597    0.000
    anomia67          4.731    0.453   10.441    0.000
    powerless67       2.563    0.403    6.359    0.000
    anomia71          4.399    0.515    8.542    0.000
    powerless71       3.070    0.434    7.070    0.000
    ses               6.798    0.649   10.475    0.000
    alien67           4.841    0.467   10.359    0.000
    alien71           4.083    0.404   10.104    0.000

R-Square:
                   Estimate
    education         0.708
    sei               0.412
    anomia67          0.600
    powerless67       0.726
    anomia71          0.649
    powerless71       0.692
    alien67           0.317
    alien71           0.497

Defined Parameters:
                   Estimate  Std.Err  Z-value  P(>|z|)
    IndirectEffect   -0.349    0.041   -8.538    0.000

Issues in SEM

Identification

Identification generally refers to the problem of finding a unique estimate of the value for each parameter in the model. Consider the following: \[ a + b = 2\]

There is no way for us to determine a unique solution for \(a\) and \(b\), e.g. the values of 1 and 1 work just as well as -1052 and 1054 and infinite other combinations. We can talk about 3 basic scenarios, and the problem generally regards how much information we have (in terms of (co)variances) vs. how many parameters we want to estimate in the model.

  • A model which has an equal number of observations (again, in terms of (co)variances) and parameters to estimate would have zero degrees of freedom, and is known as a just identified model. In a just identified model there are no extra degrees of freedom leftover to test model fit.
  • Underidentified models are models where it is not possible to find a unique estimate for each parameter. These models may have negative degrees of freedom or problematic model structures, as in the example above, and you’ll generally know right away there is a problem as whatever software package will note an error, warning, or not provide output.
  • Overidentified models have positive degrees of freedom, meaning there is more than enough pieces of information to estimate each parameter. It is desirable to have overidentified models as it allows us to use other measures of model fit.

Consider the following CFA example in which we try to estimate a latent variable model with only two observed variables, as would be the case in the prior Alienation measurement models if they are examined in isolation. We have only two variances and one covariance to estimate two paths, the latent variable variance and the two residual variances. By convention, a path is always fixed at 1 to scale the latent variable, but that still leaves us with four parameters to estimate with only three pieces of information, hence the -1 degrees of freedom and other issues in the output.

modelUnder = 'LV =~ x1 + x2'
modelJust = 'LV =~ x1 + x2 + x3'
underModel = cfa(modelUnder, data=cbind(x1,x2,x3))

summary(underModel)
lavaan (0.5-20) converged normally after  10 iterations

  Number of observations                           100

  Estimator                                         ML
  Minimum Function Test Statistic                   NA
  Degrees of freedom                                -1

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  LV =~                                               
    x1                1.000                           
    x2                0.677       NA                  

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    x1                0.409       NA                  
    x2                0.740       NA                  
    LV                0.614       NA                  

Now if we had a third variable, we now have six pieces of information to estimate and still seven unknowns. Again though, we usually fix the first loading to 1, so it would be estimated. An alternative approach would be to fix the factor variance to some value (typically 1 to create a standardized latent variable). This will allow us to estimate a unique value for each path.

Even so, this is the just identified situation, and so the model runs, but we won’t have any fit measures because we can perfectly reproduce the observed correlation matrix.

justModel = cfa(modelJust, data=cbind(x1,x2,x3))

summary(justModel, fit=T)
lavaan (0.5-20) converged normally after  21 iterations

  Number of observations                           100

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

Model test baseline model:

  Minimum Function Test Statistic               90.742
  Degrees of freedom                                 3
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)               -388.626
  Loglikelihood unrestricted model (H1)       -388.626

  Number of free parameters                          6
  Akaike (AIC)                                 789.251
  Bayesian (BIC)                               804.882
  Sample-size adjusted Bayesian (BIC)          785.933

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent Confidence Interval          0.000  0.000
  P-value RMSEA <= 0.05                          1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  LV =~                                               
    x1                1.000                           
    x2                1.412    0.281    5.021    0.000
    x3                1.767    0.384    4.605    0.000

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    x1                0.728    0.113    6.427    0.000
    x2                0.434    0.113    3.858    0.000
    x3                0.214    0.151    1.422    0.155
    LV                0.294    0.112    2.627    0.009

Note that in the full alienation model, we have 6*7/2 = 21 variances and covariances, which provides enough to estimate the parameters of the model.

Determining identification is difficult for any complex model. Necessary conditions include there being model degrees of freedom \(\geq 0\), and scaling all latent variables, but they are not sufficient. In general though, it is enough to know conceptually what the issue is and how the information you have relates to what you can estimate.

Fit

There are many, many measures of model fit for SEM, and none of them will give you a definitive answer as to how your model is doing. Your assessment, if you use them, is to take a holistic approach to get a global sense of how your model is doing. Let’s look again at the alienation results.

fitMeasures(alienation, c('chisq', 'df', 'pvalue', 'cfi', 'rmsea', 'srmr', 'AIC'))
    chisq        df    pvalue       cfi     rmsea      srmr       aic 
    4.735     4.000     0.316     1.000     0.014     0.007 30460.548 

Chi-square test

Conceptually the chi-square test measures the discrepancy between the observed correlations and those implied by the model. In the graphical model section, we actually gave an example of reproducing a correlation from a path analysis. In general, the model goal is to reproduce them as closely as we can. It compares the fitted model with a (saturated) model that does not have any degrees of freedom. The degrees of freedom for this test are equal to the data (variances + covariances) minus the number of parameters estimated. A non-significant \(\chi^2\) suggests our predictions are not statistically different from those we observe, so yay!

  Estimator                                         ML
  Minimum Function Test Statistic                4.735
  Degrees of freedom                                 4
  P-value (Chi-square)                           0.316

Or not. Those familiar with null-hypothesis testing know that one cannot accept a null hypothesis, and attempting to do so is fundamentally illogical. Other things that affect this measure specifically include multivariate non-normality, the size of the correlations (larger ones are typically related to larger predicted vs. observed discrepancies), unreliable measures (can actually make this test fail to reject), and sample size (same as with any other model scenario and statistical significance).

So if it worries you that a core measure of model fit in SEM is fundamentally problematic, good. As has been said before, no single measure will be good enough on its own, so gather as much info as you can. Some suggest to pay more attention to the \(\chi^2\) result, but to me, the flawed logic is something that can’t really be overcome. If you use it with appropriate null hypothesis testing logic, a significant \(\chi^2\) test can tell you that something is potentially wrong with the model.

  • Note that lavaan also provides a Chi-square test which compares the current model to a model in which all paths are zero, and is essentially akin to the likelihood ratio test we might use in standard model settings (e.g. comparing against an intercept only model). For that test, we want a statistically significant result. However, one can specify a prior model conducted with lavaan to test against specifically (i.e. to compare nested models).

CFI etc.

The comparative fit index compares the fitted model to a null model that assumes there is no relationship among the measured items. CFI values larger than .9 or especially .95 are desired. Others include the Tucker-Lewis Fit Index, which is provided in standard lavaan output, but there are more incremental fit indices where those come from.

User model versus baseline model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       0.999

RMSEA

The root mean squared error of approximation is a measure that also centers on the model-implied vs. sample covariance matrix, and, all else being equal, is lower for simpler models and larger sample sizes. Look for values less than .05. Lavaan also provides a one-sided test that the RMSE is \(\leq .05\), which ideally would be high, but the confidence interval is enough for reporting purposes.

Root Mean Square Error of Approximation:

  RMSEA                                          0.014
  90 Percent Confidence Interval          0.000  0.053
  P-value RMSEA <= 0.05                          0.930

Standardized Root Mean Square Residual:

  SRMR                                           0.007

SRMR

The standardized root mean squared residual is the mean absolute correlation residual, i.e. the difference between the observed and model-implied correlations. Historical suggestions are to also look for values less than .05, but it is better to simply inspect the residuals and note where there are large discrepancies.

residuals(alienation, type='cor')$cor
            eductn sei    anom67 pwrl67 anom71 pwrl71
education    0.000                                   
sei          0.000  0.000                            
anomia67     0.007 -0.020  0.000                     
powerless67 -0.006  0.018  0.000  0.000              
anomia71     0.007 -0.017  0.000  0.001  0.000       
powerless71 -0.001  0.001 -0.001  0.000  0.000  0.000

Fit Summarized

A brief summary of these and other old/typical measures of fit are described here. However they all have issues, and one should never use cutoffs as a basis for your ultimate thinking about model performance. Studies have been done and all the fit indices have issues in various scenarios, and the cutoffs do not hold up under scrutiny. While they can provide some assistance in the process, they are not meant to overcome a global assessment of theory-result compatibility.

Model Comparison

All of the above, while rooted in model comparison approaches, are by themselves only providing information about the fit or lack thereof regarding the current model. In any modeling situation, SEM or otherwise, a model comparison approach is to be preferred. Even when we don’t have the greatest model, being able to choose among viable options can help science progress.

AIC

AIC is a good way to compare models in SEM just as it is elsewhere, where a penalty is imposed on the likelihood based on the number of parameters estimated, i.e. model complexity. The value by itself is not useful, but the ‘better’ model will have a lower value. A natural question arises as to how low is low enough to prefer one model over another, but this is impossible to answer because the value of AIC varies greatly with the data and model in question. However, this frees you to know which is ‘better’, at least in terms of AIC, while still allowing you to consider the merits of the model even so. If the lower AIC is associated with the simpler model, you’d be hard-pressed to justify not taking it.

BIC

One can probably ignore the BIC in this context. This isn’t actually Bayesian, and if you are using a Bayesian approach, WAIC or DIC would be appropriate. If you aren’t using a Bayesian approach, then AIC would likely be preferable in most circumstances. The BIC has a different penalty than AIC, and is not a measure based on predictive performance, which is what we typically want in model selection.

Example

Let’s compare the previous model to one without the indirect effect and in which the SES and Alienation contributions are independent (i.e. just make the previous code change to alien67 ~ 0*ses). We’ll use the semTools package for easy side by side comparison.

library(semTools)
## compareFit(alienation, alienationNoInd)
################### Nested Model Comparison #########################
                                chi df      p delta.cfi
alienation - alienationNoInd 200.28  1  <.001    0.0941
  chisq df pvalue cfi tli aic bic rmsea srmr
alienation 4.735 4 .316† .000† .999† 30460.548† 30542.783† .014† .007†
alienationNoInd 205.017 5 .000 .906 .717 30658.830 30736.227 .207 .173

The first result is a likelihood ratio test. The model with no path is nested within the model with a path and so this is a viable option. It tells us essentially that adding the indirect path results in a statistically significantly better model. In terms of fit indices, the model including the indirect effect is better. So now we can say that not only does our model appear to fit the data well, but is better than a plausible competitor.

Prediction

While the fitted correlation matrix is nice to be able to obtain, it has always struck me a bit odd that one can’t even predict the actual data with typical SEM software. Part of this is due to the fact that the models regard the covariance matrix as opposed to the raw data, and that is the focus in many SEM situations. But in path analysis, measurement models, and SEM where mean structures are of focus (e.g. growth curves), it stands to reason that one would like to get predicted values and/or be able to test a model on new data. Even in more complex models, predictions can be made by fixing parameters at estimated values and supplying new data.

Lavaan at least does do this for you, and its lavPredict function allows one to get predicted values for both latent and observed variables, for the current or new data. There is also a fit index, obtainable through the fitMeasures function, called Expected Value of Cross-Validation Index (or ECVI), that speaks to this notion. In addition, the semTools package is a great resource for comparing models generally, comparing models across groups, model simulation and so forth.

Observed covariates

Just to be clear, SEM doesn’t only have to be about structural relations among latent variables. At any point observed covariates can be introduced to the structural model as well, and this is common practice. As an example, the alienation model is fundamentally wrong, as it doesn’t include many background or other characteristics we’d commonly collect on individuals and which might influence feelings of alienation.

Interactions

Interactions among both observed and latent variables can be included in SEM, and have the same interpretation as they would in any regression model. A common term for this in SEM is moderation. While, many depictions in SEM suggest that one variable moderates another, just like with standard interactions it is arbitrary whether one says A interacts with/moderates B or vice versa, and this fact doesn’t change just because we are doing SEM. For latent variables, one can think of adding a latent variable whose indicator variables consists of product terms of the indicators for the latent variables we want to have an interaction. See indProd and probe2WayMC in the semTools package.

Estimation

In everything demonstrated thus far, we have been using standard maximum likelihood to estimate the parameters. This often may not be the best choice. The following list comes from the MPlus manual, and most of these are available in lavaan.

  • ML: maximum likelihood parameter estimates with conventional standard errors and chi-square test statistic
  • MLM: maximum likelihood parameter estimates with standard errors and a mean-adjusted chi-square test statistic that are robust to non-normality. The chi-square test statistic is also referred to as the Satorra-Bentler chi-square.
  • MLMV: maximum likelihood parameter estimates with standard errors and a mean- and variance-adjusted chi-square test statistic that are robust to non-normality
  • MLR: maximum likelihood parameter estimates with standard errors and a chi-square test statistic (when applicable) that are robust to non-normality and non-independence of observations when used with TYPE=COMPLEX. The MLR standard errors are computed using a sandwich estimator. The MLR chi-square test statistic is asymptotically equivalent to the Yuan-Bentler T2* test statistic.
  • MLF: maximum likelihood parameter estimates with standard errors approximated by first-order derivatives and a conventional chi-square test statistic
  • MUML: Muthén’s limited information parameter estimates, standard errors, and chi-square test statistic
  • WLS: weighted least square parameter estimates with conventional standard errors and chi-square test statistic that use a full weight matrix. The WLS chi-square test statistic is also referred to as ADF when all outcome variables are continuous.
  • WLSM: weighted least square parameter estimates using a diagonal weight matrix with standard errors and mean-adjusted chi-square test statistic that use a full weight matrix
  • WLSMV: weighted least square parameter estimates using a diagonal weight matrix with standard errors and mean- and variance-adjusted chi-square test statistic that use a full weight matrix
  • ULS: unweighted least squares parameter estimates
  • ULSMV: unweighted least squares parameter estimates with standard errors and a mean- and variance-adjusted chi-square test statistic that use a full weight matrix
  • GLS: generalized least square parameter estimates with conventional standard errors and chi-square test statistic that use a normal-theory based weight matrix
  • Bayes: Bayesian posterior parameter estimates with credibility intervals and posterior predictive checking13

Missing data

A lot of data of interest in applications of SEM have missing values. Two common approaches to dealing with this are Full Information Maximum Likelihood (FIML) and Multiple Imputation (MI), and both are generally available in SEM packages. This is far too detailed an issue to treat adequately here, though we can take a moment to describe the approach generally. FIML uses the available information in the data (think pairwise correlations). MI uses a process to estimate the raw data values, and to adequately account for the uncertainty in those guesses, it creates multiple versions of complete data sets, each with different estimates of the missing values. The SEM model is run on all of them and estimates combined across all models (e.g. the mean path parameter). The imputation models, i.e. those used to estimate the missing values, can be any sort of regression model, including using variables not in the SEM model.

In addition, Bayesian approaches can estimate the missing values as additional parameters in the model (in fact, MI is essentially steeped within a Bayesian perspective). Also there may additional concerns when data is missing over time, i.e. longitudinal dropout. Using the lavaan package is nice because it comes with FIML, and the semTools package adds MI.

Other SEM approaches

SEM is very flexible and applicable to a wide variety of modeling situations. Some of these will be covered in their own module (e.g. mixture models, growth curve modeling).

How to fool yourself with SEM

Kline’s third edition text listed over 50(!) ways in which one could fool themselves with SEM, which speaks to the difficulty in running SEM and dealing with all of its issues. I will note a handful of some of them to keep in mind in particular.

Sample size

If you don’t have at least a thousand observations, you will probably only be able to conduct (possibly unrealistically) simple SEM models, or just the measurement models for scale development, or only structural models with observed variables (path analysis). Even with simpler modeling situations, one should have several hundred observations. In the simple alienation model above, we already are dealing with 17 parameters, and it doesn’t include any background covariates of the individuals, which is unrealistic. Furthermore, because it’s a mediation model, adding such variables might require additional direct and indirect paths, time-varying covariates that would have effects at both years, etc., and the number of parameters could balloon quite quickly.

One will see many articles of published research with low sample sizes using SEM. This doesn’t make it correct to do so, and one should be highly suspicious of the results suggested in those papers as they are overfit or not including relevant information.

Poor data

If the correlations among the data are low, one isn’t going to magically have strong effects by using SEM. I have seen many clients running these models and who are surprised that they don’t turn out well, when a quick glance at the correlation matrix would have suggested that there wasn’t much to work with in the first place.

Naming a latent variable doesn’t mean it exists

While everything may turn out well for one’s measurement model, and the results in keeping with theory, that doesn’t make it so. This is especially so with less reliable measures. Latent constructs require operational definitions and other considerations in order to be useful, and rule out that one isn’t simply measuring something else, or that it makes sense that such a construct has real (physical ties).

As an example, many diagnoses in the Diagnostic and Statistical Manual of Mental Disorders have not even been shown to exist via a statistical approach like SEM14, while others are simply assumed to exist, and even presumably (subsequently) supported by measurement models (often with low N), only to be shown to have no ties to any underlying physiology.

Ignoring diagnostics

Ignoring residuals, warning messages, etc. is a sure path to trying to interpret nonsensical results. Look at your residuals, fitted values etc. If your SEM software of choice is giving you messages, find out what they mean, because it may be very important.

Ignoring performance

As in our previous path analysis example, one can write a paper on a good fitting model with statistically significant results, and not explain any of the target variables. Check things like R-square (and accuracy if binary outputs) when running your models.

Summary

SEM is a powerful modeling approach that generalizes many other techniques, but it simply cannot be used lightly. Strong theory, strong data, and a lot of data can potentially result in quite interesting models that have a lot to say about the underlying constructs of interest. Go into it with competing ideas, and realize that your theory is wrong from the outset, even if there is evidence that it isn’t way off.

R packages used

  • lavaan
  • semTools
  • semPlots

Latent Growth Curves

Latent growth curve (LGC) models are in a sense, just a different form of the very commonly used mixed model framework. In some ways they are more flexible, mostly in the standard structural equation modeling framework that allows for indirect, and other complex covariate relationships. In other ways, they are less flexible (e.g. requiring balanced data, estimating nonlinear relationships, data with many time points, dealing with time-varying covariates). With appropriate tools there is little one can’t do with the normal mixed model approach relative to the SEM approach, and one would likely have easier interpretation. As such I’d recommend sticking with the standard mixed model framework unless you really need to.

That said, growth curve models are a very popular SEM technique, so it makes sense to become familiar with them. To best understand a growth curve model, I still think it’s instructive to see it from the mixed model perspective, where things are mostly interpretable from what you know from a SLiM.

Random effects

Often data is clustered, e.g. students within schools or observations for individuals over time. The standard linear model assumes independent observations, and in these situations we definitely do not have that.

One very popular way to deal with these are a class of models called mixed effects models, or simply mixed models. They are mixed, because there is a mixture of fixed effects and random effects. The fixed effects are the regression coefficients one has in standard modeling approaches.

The random effects allow each cluster to have its own unique effect in addition to the overall fixed effect. This is simply a random deviation, almost always normally distributed in practice, from the overall intercept and slopes. Mixed models are a balanced approach between ignoring these unique contributions, and over-contextualizing by running separate regressions for every cluster.

Random Effects in SEM

As we’ve seen with other models, the latent variables are assumed normally distributed, usually with zero mean, and some estimated variance. Well so are the random effects in mixed models, and it’s through this that we can maybe start to get a sense of random effects as latent variables (or vice versa). Indeed, mixed models have ties to many other kinds of models (e.g. spatial, additive), because they too add a ‘random’ component to the model in some fashion.

Simulating Random Effects

Through simulation we can demonstrate conceptual understanding of mixed models, and be well on our way toward better understanding LGC models. We’ll have balanced data with four time-points for 500 individuals (subjects).

set.seed(1234)
n = 500
timepoints = 4
time = rep(0:3, times=n)
subject = rep(1:n, each=4)

We’ll have ‘fixed’ effects, i.e. our standard regression intercept and slope, set at .5 and .25 respectively. We’ll allow their associated subject-specific effects to have a slight correlation (.2), and as such we’ll draw them from a multivariate normal distribution (variance of 1 for both effects).

intercept = .5
slope = .25
randomEffectsCorr = matrix(c(1,.2,.2,1), ncol=2) 
randomEffects = MASS::mvrnorm(n, mu=c(0,0), Sigma = randomEffectsCorr, empirical=T) %>% 
  data.frame()
colnames(randomEffects) = c('Int', 'Slope')

Let’s take a look at the data thus far. Note how I’m using subject as a row index. This will spread out the n random effects to n*timepoints total, while being constant within a subject.

data.frame(Subject=subject, time=time, randomEffects[subject,]) %>% 
  head(20)
    Subject time        Int      Slope
1         1    0 -1.4774750  0.4536349
1.1       1    1 -1.4774750  0.4536349
1.2       1    2 -1.4774750  0.4536349
1.3       1    3 -1.4774750  0.4536349
2         2    0  0.6390051 -0.9525007
2.1       2    1  0.6390051 -0.9525007
2.2       2    2  0.6390051 -0.9525007
2.3       2    3  0.6390051 -0.9525007
3         3    0  0.7736171  1.1377251
3.1       3    1  0.7736171  1.1377251
3.2       3    2  0.7736171  1.1377251
3.3       3    3  0.7736171  1.1377251
4         4    0 -2.1972780 -1.0066772
4.1       4    1 -2.1972780 -1.0066772
4.2       4    2 -2.1972780 -1.0066772
4.3       4    3 -2.1972780 -1.0066772
5         5    0 -0.1922775  1.8472234
5.1       5    1 -0.1922775  1.8472234
5.2       5    2 -0.1922775  1.8472234
5.3       5    3 -0.1922775  1.8472234

Now, to get a target variable, we simply add the random effects for the intercept to the overall intercept, and likewise for the slopes. We’ll throw in some noise at the end.

sigma = .5
y1 = (intercept + randomEffects$Int[subject]) + (slope + randomEffects$Slope[subject])*time + rnorm(n*timepoints, mean=0, sd=sigma)

d = data.frame(subject, time, y1)
head(d)
  subject time         y1
1       1    0 -1.5801417
2       1    1 -0.1231067
3       1    2 -0.3397778
4       1    3  1.4511151
5       2    0  1.4904810
6       2    1 -0.5164371

Let’s estimate this as a mixed model first15. See if you can match the parameters from our simulated data to the output.

library(lme4)
mixedModel = lmer(y1 ~ time + (1 + time|subject), data=d)  # 1 represents the intercept
## summary(mixedModel)
Linear mixed model fit by REML ['lmerMod']
Formula: y1 ~ time + (1 + time | subject)
   Data: d

REML criterion at convergence: 5833.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.36211 -0.48276  0.02046  0.47515  2.84524 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 subject  (Intercept) 0.9999   0.9999       
          time        0.9898   0.9949   0.21
 Residual             0.2382   0.4881       
Number of obs: 2000, groups:  subject, 500

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.48986    0.04830  10.141
time         0.26400    0.04555   5.796

Our fixed effects are the values we set for the overall intercept and slope. The estimated random effects variances are at 1, the correlation near .2, and finally, our residual standard deviation is near the .5 value we set.

Running a Growth Curve Model

As before, we’ll use lavaan, but now the syntax will look a bit strange compared to what we’re used to, because we have to fix the factor loadings to specific values in order to make it work. This also leads to non-standard output relative to other SEM models, as there is nothing to estimate for the many fixed parameters. Furthermore, our data needs to be in wide format, where each row represents a person, as opposed to the long format we used in the previous mixed model.

head(d)
  subject time         y1
1       1    0 -1.5801417
2       1    1 -0.1231067
3       1    2 -0.3397778
4       1    3  1.4511151
5       2    0  1.4904810
6       2    1 -0.5164371
library(tidyr)
dWide = d %>%  
  spread(time, y1)

# change the names, as usually things don't work well if they start with a number
colnames(dWide)[-1] = paste0('y', 0:3)

model = "
    # intercept and slope with fixed coefficients
    i =~ 1*y0 + 1*y1 + 1*y2 + 1*y3
    s =~ 0*y0 + 1*y1 + 2*y2 + 3*y3
"

growthCurveModel = growth(model, data=dWide)
summary(growthCurveModel)
lavaan (0.5-20) converged normally after  42 iterations

  Number of observations                           500

  Estimator                                         ML
  Minimum Function Test Statistic               10.616
  Degrees of freedom                                 5
  P-value (Chi-square)                           0.060

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  i =~                                                
    y0                1.000                           
    y1                1.000                           
    y2                1.000                           
    y3                1.000                           
  s =~                                                
    y0                0.000                           
    y1                1.000                           
    y2                2.000                           
    y3                3.000                           

Covariances:
                   Estimate  Std.Err  Z-value  P(>|z|)
  i ~~                                                
    s                 0.226    0.050    4.512    0.000

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y0                0.000                           
    y1                0.000                           
    y2                0.000                           
    y3                0.000                           
    i                 0.487    0.048   10.072    0.000
    s                 0.267    0.045    5.884    0.000

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y0                0.287    0.041    6.924    0.000
    y1                0.219    0.021   10.501    0.000
    y2                0.185    0.027    6.748    0.000
    y3                0.357    0.065    5.485    0.000
    i                 0.977    0.076   12.882    0.000
    s                 0.969    0.065   14.841    0.000

Note that lavaan has a specific function, growth, to use for these models. It doesn’t spare us any effort for the model syntax, but does make it unnecessary to set various things with the sem function. Most of the output is blank, which is needless clutter, but we do get the same five parameter values we are interested in though.

Start with the ‘intercepts’:

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)
                       
    i                 0.487    0.048   10.072    0.000
    s                 0.267    0.045    5.884    0.000

It might be odd to call your fixed effects ‘intercepts’, but it make sense if we are thinking of it as a multilevel model as depicted previously, where we actually broke out the random effects as a separate model. The estimates here are pretty much spot on with our mixed model estimates, which are identical to just the standard regression estimates.

fixef(mixedModel)
(Intercept)        time 
  0.4898598   0.2640034 
lm(y1 ~ time, data=d)

Call:
lm(formula = y1 ~ time, data = d)

Coefficients:
(Intercept)         time  
     0.4899       0.2640  

Now let’s look at the variance estimates. The estimation of residual variance for each y in the LGC distinguishes the two approaches, but not necessarily so. We could fix them to zero here or allow them to be estimated in the mixed model framework. Just know that’s why the results are not identical (to go along with their respective estimation approaches, which are also different by default). Again though, the variances are near one, and the correlation between the intercepts and slopes is around the .2 value.

Covariances:
                   Estimate  Std.Err  Z-value  P(>|z|)
  i ~~                                                
    s                 0.226    0.050    4.512    0.000
Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y0                0.287    0.041    6.924    0.000
    y1                0.219    0.021   10.501    0.000
    y2                0.185    0.027    6.748    0.000
    y3                0.357    0.065    5.485    0.000
    i                 0.977    0.076   12.882    0.000
    s                 0.969    0.065   14.841    0.000
VarCorr(mixedModel)
 Groups   Name        Std.Dev. Corr 
 subject  (Intercept) 0.99994       
          time        0.99488  0.208
 Residual             0.48806       

The differences provide some insight. LGC by default assumes heterogeneous variance for each time point. Mixed models by default assume the same \(\sigma^2\) for each time point, but can allow them to be estimated separately in most modeling packages.

As an example, if we fix the variances to be equal, the models are now identical.

model = "
    # intercept and slope with fixed coefficients
    i =~ 1*y0 + 1*y1 + 1*y2 + 1*y3
    s =~ 0*y0 + 1*y1 + 2*y2 + 3*y3

    y0 ~~ resvar*y0    
    y1 ~~ resvar*y1
    y2 ~~ resvar*y2
    y3 ~~ resvar*y3
"

growthCurveModel = growth(model, data=dWide)
summary(growthCurveModel)
summary(mixedModel, corr=F)
lavaan (0.5-20) converged normally after  27 iterations

  Number of observations                           500

  Estimator                                         ML
  Minimum Function Test Statistic               17.105
  Degrees of freedom                                 8
  P-value (Chi-square)                           0.029

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  i =~                                                
    y0                1.000                           
    y1                1.000                           
    y2                1.000                           
    y3                1.000                           
  s =~                                                
    y0                0.000                           
    y1                1.000                           
    y2                2.000                           
    y3                3.000                           

Covariances:
                   Estimate  Std.Err  Z-value  P(>|z|)
  i ~~                                                
    s                 0.207    0.050    4.170    0.000

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y0                0.000                           
    y1                0.000                           
    y2                0.000                           
    y3                0.000                           
    i                 0.490    0.048   10.151    0.000
    s                 0.264    0.046    5.802    0.000

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y0      (rsvr)    0.238    0.011   22.361    0.000
    y1      (rsvr)    0.238    0.011   22.361    0.000
    y2      (rsvr)    0.238    0.011   22.361    0.000
    y3      (rsvr)    0.238    0.011   22.361    0.000
    i                 0.998    0.074   13.478    0.000
    s                 0.988    0.066   15.076    0.000

Linear mixed model fit by REML ['lmerMod']
Formula: y1 ~ time + (1 + time | subject)
   Data: d

REML criterion at convergence: 5833.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.36211 -0.48276  0.02046  0.47515  2.84524 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 subject  (Intercept) 0.9999   0.9999       
          time        0.9898   0.9949   0.21
 Residual             0.2382   0.4881       
Number of obs: 2000, groups:  subject, 500

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.48986    0.04830  10.141
time         0.26400    0.04555   5.796

In addition, the random coefficients estimates from the mixed model perfectly correlate with those of the latent variables.

ranefLatent = data.frame(coef(mixedModel)[[1]], lavPredict(growthCurveModel))
ranefLatent %>% round(2) %>% head
  X.Intercept.  time     i     s
1        -1.12  0.72 -1.12  0.72
2         0.90 -0.61  0.90 -0.61
3         1.08  1.72  1.08  1.72
4        -1.86 -0.68 -1.86 -0.68
5         0.06  1.94  0.06  1.94
6         0.87  0.24  0.87  0.24
ranefLatent %>% cor %>% round(2)
             X.Intercept. time    i    s
X.Intercept.         1.00 0.29 1.00 0.29
time                 0.29 1.00 0.29 1.00
i                    1.00 0.29 1.00 0.29
s                    0.29 1.00 0.29 1.00

Both approaches allow those residuals to covary, though it gets tedious in SEM syntax, while it is a natural extension in the mixed model framework. Here is the syntax for letting each time point covary with the next.

model = "
    # intercept and slope with fixed coefficients
    i =~ 1*y0 + 1*y1 + 1*y2 + 1*y3
    s =~ 0*y0 + 1*y1 + 2*y2 + 3*y3

    # all of the following is needed for what are essentially only two parameters 
    # to estimate- resvar and correlation (the latter defined explicitly here)
    y0 ~~ resvar*y0
    y1 ~~ resvar*y1
    y2 ~~ resvar*y2
    y3 ~~ resvar*y3

    # timepoints 1 step apart; technically the covariance is e.g. a*sqrt(y0)*sqrt(y1), 
    # but since the variances are constrained to be equal, we don't have to be so verbose.
    y0 ~~ a*y1
    y1 ~~ a*y2
    y2 ~~ a*y3
    
    # two steps apart
    y0 ~~ b*y2
    y1 ~~ b*y3
    
    # three steps apart
    y0 ~~ c*y3
    
    # fix parameters according to ar1
    b == a^2
    c == a^3
"
lavaan (0.5-20) converged normally after 287 iterations

  Number of observations                           500

  Estimator                                         ML
  Minimum Function Test Statistic               14.881
  Degrees of freedom                                 7
  P-value (Chi-square)                           0.038

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
  i =~                                                                  
    y0                1.000                               0.980    0.884
    y1                1.000                               0.980    0.603
    y2                1.000                               0.980    0.399
    y3                1.000                               0.980    0.291
  s =~                                                                  
    y0                0.000                               0.000    0.000
    y1                1.000                               0.991    0.610
    y2                2.000                               1.983    0.808
    y3                3.000                               2.974    0.882

Covariances:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
  y0 ~~                                                                 
    y1         (a)    0.029    0.021    1.397    0.162    0.029    0.110
  y1 ~~                                                                 
    y2         (a)    0.029    0.021    1.397    0.162    0.029    0.110
  y2 ~~                                                                 
    y3         (a)    0.029    0.021    1.397    0.162    0.029    0.110
  y0 ~~                                                                 
    y2         (b)    0.001    0.001    0.698    0.485    0.001    0.003
  y1 ~~                                                                 
    y3         (b)    0.001    0.001    0.698    0.485    0.001    0.003
  y0 ~~                                                                 
    y3         (c)    0.000    0.000    0.466    0.641    0.000    0.000
  i ~~                                                                  
    s                 0.217    0.051    4.293    0.000    0.223    0.223

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
    y0                0.000                               0.000    0.000
    y1                0.000                               0.000    0.000
    y2                0.000                               0.000    0.000
    y3                0.000                               0.000    0.000
    i                 0.492    0.048   10.195    0.000    0.502    0.502
    s                 0.263    0.046    5.765    0.000    0.265    0.265

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)   Std.lv  Std.all
    y0      (rsvr)    0.267    0.025   10.714    0.000    0.267    0.218
    y1      (rsvr)    0.267    0.025   10.714    0.000    0.267    0.101
    y2      (rsvr)    0.267    0.025   10.714    0.000    0.267    0.044
    y3      (rsvr)    0.267    0.025   10.714    0.000    0.267    0.024
    i                 0.960    0.079   12.155    0.000    1.000    1.000
    s                 0.983    0.066   14.883    0.000    1.000    1.000

Constraints:
                                               |Slack|
    b - (a^2)                                    0.000
    c - (a^3)                                    0.000
Linear mixed-effects model fit by maximum likelihood
 Data: d 
       AIC      BIC    logLik
  5836.589 5875.795 -2911.295

Random effects:
 Formula: ~1 + time | subject
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.9768961 (Intr)
time        0.9907609 0.225 
Residual    0.5215351       

Correlation Structure: AR(1)
 Formula: ~time | subject 
 Parameter estimate(s):
      Phi 
0.1235872 
Fixed effects: y1 ~ time 
                Value  Std.Error   DF   t-value p-value
(Intercept) 0.4918555 0.04827394 1499 10.188841       0
time        0.2628092 0.04559937 1499  5.763439       0
 Correlation: 
     (Intr)
time 0.121 

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.29559946 -0.45932855  0.01075219  0.45334542  2.76051403 

Number of Observations: 2000
Number of Groups: 500 

Thinking more generally about regression

In fact, your standard regression is already equipped to handle heterogeneous variances and a specific correlation structure for the residuals. In reality the linear model is the following model:

\[y \sim N(X\beta, \Sigma)\]

\(X\beta\) represents the linear predictor, i.e. the linear combination of your predictors, and a big, N by N covariance matrix \(\Sigma\). Thus the target variable \(y\) is multivariate normal with mean vector \(X\beta\) and covariance \(\Sigma\).

SLiMs assume that the covariance matrix is constant diagonal. A single value on the diagonal, \(\sigma^2\), and zeros on the off-diagonals. Mixed models, however, can allow the covariance structure to be specified in myriad ways, and it ties them to still other models, which in the end produces a very flexible modeling framework.

More on LGC

LGC are non-standard SEM

In no other SEM situation are you likely to fix so many parameters or think about your latent variables in this manner. This can make for difficult interpretations relative to the mixed model (unless you are aware of the parallels).

Residual correlations

Typical models that would be investigated via the LGC have correlated residuals as depicted above.

Nonlinear time effect

A nonlinear time effect can be estimated if we don’t fix all the parameters for the slope factor. As an example, the following would actually estimate the loadings for times in between the first and last point.

    s =~ 0*y0 + y1 + y2 + 1*y3

It may be difficult to assess nonlinear relationships unless one has many time points16, and even then, one might get more with an additive mixed model approach.

Growth Mixture Models

Adding a latent categorical variable would allow for different trajectories across the latent groups. Most clients that I’ve seen typically did not have enough data to support it, as one essentially can be estimating a whole growth model for each group. Some might restrict certain parameters for certain groups, but given that the classes are a latent construct to be discovered, there would not be a theoretical justification to do so, and it would only complicate interpretation at best. Researchers rarely if ever predict test data, nor provide evidence that the clusters hold up with alternate data. In addition, it seems that typical interpretation of the classes takes on an ordered structure (e.g. low, medium, and high), which means they just have a coarsely measured continuous latent variable. Had they started under the assumption of a continuous latent variable, it might have made things easier to interpret and estimate.

As of this writing, Mplus is perhaps the only SEM software used for this, and it requires yet another syntax style, and, depending on the model you run, some of the most confusing output you’ll ever see in SEM. Alternatives in R include flexmix (demonstrated in the Mixture Models Module) for standard mixture modeling (including mixed effects models), as well as the R package OpenMx.

Other covariates

Cluster level

To add a cluster level covariate, for a mixed model, it looks something like this:

standard random intercept \[y = b0_c + b1*time + e \] \[b0_c = b0 + u_c\]

Plugging in becomes: \[y = b0 + b1*time + u_c + e \]

subject level covariate added \[b0_c = b0 + sex + u_c\]

But if we plug that into our level 1 model, it just becomes: \[y = b0 + sex + b1*time + u_c + e\]

In our previous modeling syntax it would look like this:

mixedModel = lmer(y1 ~ sex + time + (time|subject), data=d)

We’d have a fixed effect for sex and interpret it just like in the standard setting. Similarly, if we had a time-varying covariate, say socioeconomic status, it’d look like the following:

mixedModel = lmer(y1 ~ time + ses + (time|subject), data=d)

Though we could have a random slope for SES if we wanted. You get the picture. Most of the model is still standard regression interpretation.

With LGC, there is a tendency to interpret the model as an SEM, and certainly one can. But adding additional covariates typically causes confusion for those not familiar with mixed models. We literally do have to regress the intercept and slope latent variables on cluster level covariates as follows.

model.syntax <- '
  # intercept and slope with fixed coefficients
    i =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
    s =~ 0*y1 + 1*y2 + 2*y3 + 3*y4

  # regressions
    i ~ x1 + x2
    s ~ x1 + x2
'

Applied researchers commonly have difficulty on interpreting the model due to past experience with SEM. While these are latent variables, they aren’t just latent variables or underlying constructs. It doesn’t help that the output can be confusing, because now one has an ‘intercept for your intercepts’ and an ‘intercept for your slopes’. In the multilevel context it makes sense, but there you know ‘intercept’ is just ‘fixed effect’.

Time-varying covariates

With time varying covariates, the syntax starts to get tedious.

model.syntax <- '
  # intercept and slope with fixed coefficients
    i =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
    s =~ 0*y1 + 1*y2 + 2*y3 + 3*y4

  # regressions
    i ~ x1 + x2
    s ~ x1 + x2

  # time-varying covariates
    y1 ~ c1
    y2 ~ c2
    y3 ~ c3
    y4 ~ c4
'
fit <- growth(model.syntax, data=Demo.growth)
summary(fit)

Now imagine having just a few of those kinds of variables. In the mixed model framework one would add them in as any covariate in a regression model. In the LGC framework, one has to regress each time point for the target variable on its corresponding predictor time point. It might take a few paragraphs to explain the coefficients for just a handful of covariates.

Some Differences between Mixed Models and Growth Curves

Random slopes

One difference seen in comparing LGC models vs. mixed models is that in the former, random slopes are always assumed, whereas in the latter, one would typically see if it’s worth adding random slopes in the first place, or simply not assume them.

Wide vs. long

The SEM framework is inherently multivariate, and your data will need to be in wide format. This isn’t too big of a deal until you have many time-varying covariates, then the model syntax is tedious and you end up having the number of parameters to estimate climb rapidly. God help you if you want to investigate interactions based on time-varying covariates.

Sample size

As we have noted before, SEM is inherently a large sample technique. The mixed model does not require as much for standard approaches, but may require a lot more depending on the model one tries to estimate.

Number of time points

Mixed models can run even if some clusters have a single value. SEM requires balanced data and so one will always have to estimate missing values or drop them. Whether this missingness can be ignored in the standard mixed model framework is a matter of some debate in certain circles.

Time points

Numbering your time from zero makes sense in both worlds. This leads to the natural interpretation that the intercept is the mean for your first timepoint.

Other stuff

Here is an example of a parallel process in which we posit two growth curves at the same time, with possible correlations among them. This could be accomplished quite easily with a standard mixed model in the Bayesian framework, with a multivariate response, though I’ll have to come back to that later.

# parallel process --------------------------------------------------------

# let's simulate data with a negative slope average and positive correlation among intercepts and other process slopes
set.seed(1234)
n = 500
timepoints = 4
time = rep(0:3, times=n)
subject = rep(1:n, each=4)

# first we'll draw intercepts with overall mean .5 and -.5 for the two
# processes, and let them have a slight correlation. Their variance is 1.
intCorr = matrix(c(1,.2,.2,1), ncol=2) 
colnames(intCorr) = rownames(intCorr) = c('i1', 'i2')
intCorr

interceptP1 = .5
interceptP2 = -.5

ranInts = MASS::mvrnorm(n, mu=c(0,0), Sigma = intCorr, empirical=T)
ranInts = data.frame(ranInts)
head(ranInts)
cor(ranInts)
colMeans(ranInts)

# now create slopes with intercept/mean .4, -.4, but the same positive
# relationship with their respective intercept. Their variances are also 1.
slopeP1 = .4
slopeP2 = -.4

s1 = .3*ranInts$i2  + rnorm(n)
s2 = .3*ranInts$i1  + rnorm(n)

ranef = data.frame(ranInts, s1, s2)
head(ranef)


# so we have slight positive correlations among all random intercepts and slopes
y1 = (interceptP1 + ranef$i1[subject]) + (slopeP1+ranef$s1[subject])*time + rnorm(n*timepoints, sd=.5)
d1 = data.frame(Subject=subject, time=time, y1)
head(d1)

library(ggplot2)
ggplot(aes(x=time, y=y1), data=d1) + 
  geom_line(aes(group=Subject), alpha=.1) + 
  geom_smooth(method='lm',color='red') +
  lazerhawk::theme_trueMinimal()


y2 = (interceptP2 + ranef$i2[subject]) + (slopeP2+ranef$s2[subject])*time + rnorm(n*timepoints, sd=.5)
d2 = data.frame(Subject=subject, time=time, y2)

# process 2 shows the downward overall trend as expected
ggplot(aes(x=time, y=y2), data=d2) + 
  geom_line(aes(group=Subject), alpha=.1) + 
  geom_smooth(method='lm',color='red') +
  lazerhawk::theme_trueMinimal()

# Widen from long form for lavaan
library(tidyr)
negslopepospath1 = d1 %>% spread(time, y1)
colnames(negslopepospath1) = c('Subject', paste0('y1', 1:4))
head(negslopepospath1)

negslopepospath2 = d2 %>% spread(time, y2)
colnames(negslopepospath2) = c('Subject', paste0('y2', 1:4))

# combine
dparallel = dplyr::left_join(negslopepospath1, negslopepospath2)
head(dparallel)

mainModel = "
i1 =~ 1*y11 + 1*y12 + 1*y13 + 1*y14
s1 =~ 0*y11 + 1*y12 + 2*y13 + 3*y14


i2 =~ 1*y21 + 1*y22 + 1*y23 + 1*y24
s2 =~ 0*y21 + 1*y22 + 2*y23 + 3*y24

s1 ~ i2
s2 ~ i1
"

library(lavaan)
mainRes  = growth(mainModel, data=dparallel)
summary(mainRes)
fscores = lavPredict(mainRes)
broom::tidy(data.frame(fscores))
lm(s2~., fscores)

lazerhawk::corrheat(cor(fscores))
qplot(s1, i2, data=data.frame(fscores)) + geom_smooth(method='lm', se=F)
fv = lavPredict(mainRes, 'ov')
summary(mainRes, stdv)
d3heatmap::d3heatmap(cor(fv, fscores))
d3heatmap::d3heatmap(cor(select(dparallel, -Subject), ranef), Rowv = F, Colv = F)


process1Model = "
i1 =~ 1*y11 + 1*y12 + 1*y13 + 1*y14
s1 =~ 0*y11 + 1*y12 + 2*y13 + 3*y14
"
p1Res = growth(process1Model, data=dparallel)
fscoresP1 = lavPredict(p1Res)

process2Model = "
i2 =~ 1*y21 + 1*y22 + 1*y23 + 1*y24
s2 =~ 0*y21 + 1*y22 + 2*y23 + 3*y24
"
p2Res = growth(process2Model, data=dparallel)
fscoresP2 = lavPredict(p2Res)

fscoresSeparate = data.frame(fscoresP1, fscoresP2)

pathMod = "
s1 ~ i2
s2 ~ i1

i1~~i2
"

pathModRes = sem(pathMod, data=fscoresSeparate, fixed.x = F)
summary(pathModRes)  # you'd have come to the same conclusions
summary(mainRes)

Summary

Growth curve modeling is an alternate way to do what is very commonly accomplished through mixed models, and allow for more complex models than typically seen for standard mixed models. One’s default should probably be to use the more common, and probably more flexible (in most situations), mixed modeling tools. However, the latent variable approach may provide what you need, and at the very least gives you a fresh take on the standard mixed model perspective.

Mixture Models

Thus far we have understood latent variables as possessing an underlying continuum, i.e. normally distributed with a mean of zero and some variance. This does not have to be the case, and instead we can posit a categorical variable. Some approaches you may in fact be already familiar with, as any modeling process under the heading of ‘cluster analysis’ could be said to deal with latent categorical variables. The issue is that we may feel that there is some underlying structure to the data that is described as discrete, and based on perhaps multiple variables.

We will approach this in the way that has been done from statistical and related motivations, rather than the SEM/psychometric approach. This will make clearer what it is we’re dealing with as well as not get bogged down in terminology. Furthermore, mixture models are typically poorly implemented within SEM, as many of the typical issues can often be magnified. The goal here as before is clarity of thought over ‘being able to run it’.

A common question in such analysis is how many clusters? There are many, many techniques for answering this question, and not a single one of them even remotely definitive. On the plus side, the good news is that we already know the answer, because the answer is always 1. However, that won’t stop us from trying to discover more than that, so here we go.

A Motivating Example

Take a look at the following data. It regards the waiting time between eruptions and the duration of the eruption (both in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

library(ggplot2)
data("faithful")

Take a good look. This is probably the cleanest separation of clustered data you will likely ever see, and even so there are still data points that might fall into either cluster.

Create Clustered Data

To get a sense of mixture models, let’s actually create some data that might look like the Old Faithful data above. In the following we create something that will look like the eruptions variable in the faithful data. To do so, we draw one random sample from a normal distribution with a mean of 2, and the other with a mean of 4.5, and both get a standard deviation of .25. The first plot is based on the code below, the second on the actual data.

library(ggplot2)
set.seed(1234)
erupt1 = rnorm(150, mean=2, sd=.25)
erupt2 = rnorm(150, mean=4.5, sd=.25)
erupt = sample(c(erupt1, erupt2))

What do we have and what do we see? The data is a mixture of two normals, but we can think of the observations as belonging to a latent class, and each class has its own mean and standard deviation (and is based on a normal, but doesn’t have to be). Each observation has some likelihood, however great or small, of coming from either cluster, and had we really wanted to do more appropriate simulation, we would incorporate that information.

A basic approach for categorical latent variable analysis from a model based perspective17:

  1. Posit the number of clusters you believe there to be
  2. For each observation, estimate those probability of coming from either cluster
  3. Assign observations to the most likely class (i.e. the one with the highest probability).

More advanced approaches might include:

  • Predicting the latent classes in a manner akin to logistic regression
  • Allow your model coefficients to vary based on cluster membership
    • For example, have separate regression models for each class

Mixture modeling with Old Faithful

The following uses the flexmix package and function to estimate a regression model per class. In this case, our model includes only an intercept, and so is equivalent to estimating the mean and variance of each group. We posit k=2 groups.

We can see from the summary about 2/3 are classified to one group. We also get the estimated means and standard deviations for each group. Note that the group labels are completely arbitrary.

library(flexmix)
mod = flexmix(eruptions~1,  data=faithful, k = 2)
summary(mod)

Call:
flexmix(formula = eruptions ~ 1, data = faithful, k = 2)

       prior size post>0 ratio
Comp.1 0.652  177    190 0.932
Comp.2 0.348   95     98 0.969

'log Lik.' -276.3611 (df=5)
AIC: 562.7222   BIC: 580.7512 
head(mod@posterior$scaled, 10) %>% round(4)  # show some estimated probabilities
        [,1]   [,2]
 [1,] 1.0000 0.0000
 [2,] 0.0000 1.0000
 [3,] 1.0000 0.0000
 [4,] 0.0001 0.9999
 [5,] 1.0000 0.0000
 [6,] 0.8384 0.1616
 [7,] 1.0000 0.0000
 [8,] 1.0000 0.0000
 [9,] 0.0000 1.0000
[10,] 1.0000 0.0000
parameters(mod)    #means and std dev for each group
                    Comp.1    Comp.2
coef.(Intercept) 4.2735128 2.0187913
sigma            0.4376169 0.2363539

The first plot shows the estimated probabilities for each observation for one of the clusters (with some jitter), which in this case are all around 0 or 1. Again, you will probably never see anything like this, but clarity is useful here. The second plot shows the original data with their classification and contours of the density for each group.

SEM and Latent Categorical Variables

Dealing with categorical latent variables can be somewhat problematic. Interpreting one SEM model might be difficult enough, but then one might be allowing parts of it to change depending on which latent class observations belong to, while having to assess the latent class measurement model as well. It can be difficult to find a clarity of understanding from this process, as one is discovering classes then immediately assuming key differences among these classes they didn’t know existed beforehand18. In addition, one will need even more data than standard SEM to deal with all the additional parameters that are allowed to vary across the classes.

Researchers also tend to find classes that represent ‘hi-lo’ or ‘hi-med-lo’ groups, which may suggest they should have left the latent construct as a continuous variable. When given the choice to discretize continuous variables in normal settings, it is rare in which the answer is anything but a resounding no. As such, one should think hard about the ultimate goals of such a model.

Terminology in SEM

Latent Class Analysis refers to dealing with categorical latent variables in the context of multivariate categorical data. For example one might have a series of yes/no questions on a survey, and want to discover categories of responses.

Latent Profile Analysis refers to dealing with categorical latent variables in the context of multivariate numerical data.

Latent Categories vs. Multigroup Analysis

The primary difference between the two is that one grouping structure actually exists in the data, for example, sex, race etc. In that case, a multigroup analysis would allow for separate SEM models per group. In the latent categorical variable situation, one must first discover the latent groups. In multigroup analysis, a common goal is to test measurement invariance, a concept which has several definitions itself. An example would be to see if the latent structure holds for an American vs. foreign sample, with the same items for the scale provided in the respective languages. This makes a lot of sense from a measurement model perspective and has some obvious use.

If one wants to see a similar situation for a less practically driven model, e.g. to see if an SEM model is the same for males vs. females, this is equivalent to having an interaction with the sex variable for every path in the model. The same holds for ‘subgroup analysis’ in typical settings, where you won’t find any more than you would by including the interactions of interest with the whole sample, though you will certainly have less data to work with. In any case, we need a lot of data to compare separate models where parameters are allowed to vary by group vs. a model in which they are fixed, and many simply do not have enough data for this.

Latent Trajectories

As noted in the growth curve modeling section, these are growth curve models in which intercepts and slopes are allowed to vary across latent groups as well as the clusters. The flexmix package used previously would allow one to estimate such models from the mixed model perspective, and might be preferred.

Estimation

If you would like to see the conceptual innards of estimating mixture models using EM Algorithm, see this link on my github page.

R packages used

  • flexmix

Other

Other topics that may be covered in varying detail in the future. With packages noted.

  • IRT
    • ltm or lavaan
  • Multiple group analysis
    • lavaan
  • Multilevel SEM
  • Collaborative filtering
    • recommenderlab
  • HMM, Linear Dynamical Systems
    • HMM, HiddenMarkov depmixS4
  • Others

Excercises

Path Analysis

Part 1

This exercise investigates satisfaction in high school seniors using the 1993 Monitoring the Future dataset ('data/monitorfuture.csv'). Perform a path analysis on the four measured variables: self-esteem (esteem), overall life satisfaction (overall), locus of control (locus), and loneliness (lonely). Estimate the following model using lavaan (note that the dashed line is an ‘unanalyzed’ correlation and doesn’t need to be explicit in the model).


First get the data in, then set up your model code. The following provides a hint.

mtf = read.csv('filelocation/name.csv')
satistifaction_ModelCode = 'Y ~ X + z'

After that, run the model using the sem function in lavaan. The following provides a hint.

satistifaction_ModelResults = sem(modelcodename, data=modeldataname)
summary(satistifaction_ModelResults, rsquare=T)

Your results should look conform to the following.


Questions:

  1. What are the positive effects in this model?

  2. What are the negative effects in this model?

  3. What is your initial impression about model performance?

  4. What specifically are the R2 values for the endogenous variables?

Part 2

Rerun with the following model (click to show).


satistifaction_ModelCode_Mediation = '
  overall ~ c*esteem + lonely
  esteem ~ a*locus + b*lonely

  # Indirect effects
  locusOverall := a*c
  lonelyOverall := b*c
'


Regarding this mediation model…

  • A. Is the indirect effect for loneliness on life satisfaction statistically significant?
  • B. How about locus of control?
  • C. Do you think loneliness causes self-esteem (more lonely, less self-esteem), or vice versa, or perhaps they are simply correlated?




Factor Analysis

Part 1

Data: National Longitudinal Survey of Youth (1997, NLSY97), which investigates the transition from youth to adulthood. For this example, a series of questions asked to the participants in 2006 pertaining to the government’s role in promoting well-being will be investigated. Questions regarded the government’s responsibility for following issues: provide jobs for everyone, keep prices under control, provide health care, provide for elderly, help industry, provide for unemployed, reduce income differences, provide college financial aid, provide decent housing, protect environment. They have four values 1:4, which range from ‘definitely should be’ to ‘definitely should not be’.

  1. Run a factor analysis using the 'data/nlsy97_governmentNumeric.csv' dataset. Your first model will have all items (sans ID) loading on a single factor. With lavaan, use the cfa function. Recall that you give it the model code (as a string) and a data = argument like cfa(mod, data=mydata).

Try it on your own or click to see the general format.

govtInvolvement = read.csv('data/nlsy97_governmentNumeric.csv')
govtInvolvement_ModelCode = ''
govtInvolvement_Results = cfa(govtInvolvement_ModelCode, sample.cov=marshCov, sample.nobs=385)
summary(govtInvolvement_Results, standardized=T, fit=T)
  1. Note the standardized loadings, are the items ‘hanging together’ well? Do you see any that might be we somewhat weak?

Part 2

  1. Now specify a two factor structure of your choosing. As an example, some of these are maybe more economic in nature, while others might fit in some other category. Whatever makes sense to you, or just randomly split it.

  2. Does this seem any more interpretable? Were the latent variables notably correlated? Which model would you say is better based on internal performance, in terms of comparison (e.g. lower AIC = preferred model), and in terms of interpretability?

Click to show example (after you’ve tried yourself).

library(lavaan)
govtInvolvement = read.csv('data/nlsy97_governmentNumeric.csv')

govtInvolvement_ModelCode1 = "
  moregovt =~ ProvideJobs + Prices + HealthCare + Elderly + Industry + Unemployed + IncInequal + College + Housing + Environment
"
model1 = cfa(govtInvolvement_ModelCode1, data=govtInvolvement) 
summary(model1, fit=T, standardized=T)

govtInvolvement_ModelCode2 = '
  econ =~ ProvideJobs + Industry + Unemployed + IncInequal + Housing + Prices
  services =~ HealthCare + Elderly + College + Environment
'
model2 = cfa(govtInvolvement_ModelCode2, data=govtInvolvement) 
summary(model2, fit=T, standardized=T)

# Compare the two († means the better result)
semTools::compareFit(model1, model2, nested=F)




SEM

You get the choice of doing exercise 1 or 2.

Exercise 1

The data for this exercise comes from a paper published by Marsh and Hocevar in 1985. The data regards information on 385 fourth and fifth grade students who filled out a ‘Self-Description Questionnaire.’ The questionnaire has 16 items, four of which measure physical ability, four measure physical appearance, four measure relations with peers and the final four measure relations with parents. The data is saved in the file ‘Marsh85.dta’ as summary data with means, standard deviations and correlations. However you will just use ‘Marsh85_SEMExercise.csv’, which is the covariance matrix.

We are interested in determining how a student’s physical appearance and physical ability might predict relationships with their peers. The diagram for the model of interest is shown below.







The following will get you started. All you need to do is fill in the model code. Rather than raw data, we will be using the sample covariance matrix with sample size equal to 385. Note how you can detect the structure visually. Physical ability hangs together well and is positively correlated with the other factors, which are more strongly correlated with each other. This is extremely important in factor analysis and SEM, as they deal with the correlation matrix, you should often be able to see the structure before modeling even begins.







marsh85 = read.csv('data/Marsh85_SEMExercise.csv', row.names = 1)
marsh85 = as.matrix(marsh85)
peerRel_ModelCode = ''
peerRel_Results = sem(peerRel_ModelCode, sample.cov=marsh85, sample.nobs=385)
summary(peerRel_Results, standardized=T, fit=T, rsquare=T)



Write a brief summary in terms of an assessment of the measurement components of the model, overall impression of model fit, and specifics of the structural relations (i.e. the paths among the latent variables) and model performance in terms of R2.

Exercise 2

In this second example, we will use the classic Political Democracy dataset used by Bollen in his seminal 1989 book on structural equation modeling. This data set includes four measures of democracy at two points in time, 1960 and 1965, and three measures of industrialization in 1960, for 75 developing countries.

  • FoPress60: freedom of the press, 1960
  • FoPopp60: freedom of political opposition, 1960
  • FairElect60: fairness of elections, 1960
  • EffectiveLeg60: effectiveness of elected legislature, 1960
  • FoPress65: freedom of the press, 1965
  • FoPopp65: freedom of political opposition, 1965
  • FairElect65: fairness of elections, 1965
  • EffectiveLeg65: effectiveness of elected legislature, 1965
  • GNP60: GNP per capita, 1960
  • EnConsump60: energy consumption per capita, 1960
  • PercLaborForce60: percentage of labor force in industry, 1960

The model we wish to estimate is in according to the following diagram.








Here is some starter code. All you need to do is fill in the model code.



poldem = read.csv('data/PoliticalDemocracy.csv')
poldem_ModelCode = ''
poldem_Results = sem(poldem_ModelCode, data=poldem)
summary(poldem_Results, standardized=T, fit=T, rsquare=T)



Write a brief summary in terms of an assessment of the measurement components of the model, overall impression of model fit, and specifics of the structural relations (i.e. the paths among the latent variables) and model performance in terms of R2.

Appendix

Data set descriptions

McClelland

Description

  • McClelland et al. (2013) abstract

This study examined relations between children’s attention span-persistence in preschool and later school achievement and college completion. Children were drawn from the Colorado Adoption Project using adopted and non-adopted children (N = 430). Results of structural equation modeling indicated that children’s age 4 attention span-persistence significantly predicted math and reading achievement at age 21 after controlling for achievement levels at age 7, adopted status, child vocabulary skills, gender, and maternal education level. Relations between attention span-persistence and later achievement were not fully mediated by age 7 achievement levels. Logistic regressions also revealed that age 4 attention span-persistence skills significantly predicted the odds of completing college by age 25. The majority of this relationship was direct and was not significantly mediated by math or reading skills at age 7 or age 21. Specifically, children who were rated one standard deviation higher on attention span-persistence at age 4 had 48.7% greater odds of completing college by age 25. Discussion focuses on the importance of children’s early attention span-persistence for later school achievement and educational attainment.

Reference

McClelland, Acock, Piccinin, Rheac, Stallings. (2013). Relations between preschool attention span-persistence and age 25 educational outcomes. link

National Longitudinal Survey of Youth (1997, NLSY97)

Description

NLSY97 consists of a nationally representative sample of approximately 9,000 youths who were 12 to 16 years old as of December 31, 1996. Round 1 of the survey took place in 1997. In that round, both the eligible youth and one of that youth’s parents received hour-long personal interviews. In addition, during the screening process, an extensive two-part questionnaire was administered that listed and gathered demographic information on members of the youth’s household and on his or her immediate family members living elsewhere. Youths are interviewed on an annual basis.

The NLSY97 is designed to document the transition from school to work and into adulthood. It collects extensive information about youths’ labor market behavior and educational experiences over time. Employment information focuses on two types of jobs, “employee” jobs where youths work for a particular employer, and “freelance” jobs such as lawn mowing and babysitting. These distinctions will enable researchers to study effects of very early employment among youths. Employment data include start and stop dates of jobs, occupation, industry, hours, earnings, job search, and benefits. Measures of work experience, tenure with an employer, and employer transitions can also be obtained. Educational data include youths’ schooling history, performance on standardized tests, course of study, the timing and types of degrees, and a detailed account of progression through post-secondary schooling.

Wheaton 1977 data

Description

Longitudinal data to develop a model of the stability of alienation from 1967 to 1971, accounting for socioeconomic status as a covariate. Each of the three factors have two indicator variables, SES in 1966 is measured by education and occupational status in 1966 and alienation in both years is measured by powerlessness and anomia.

Reference

Wheaton, B., Muthen B., Alwin, D., & Summers, G., 1977, “Assessing reliability and stability in panel models”, in D. R. Heise (Ed.), Sociological Methodology 1977 (pp. 84-136), San Francisco: Jossey-Bass, Inc.

Old Faithful

From the R helpfile

Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. A closer look at faithful$eruptions reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement. For a better version of the eruption times, see the example below.

There are many versions of this dataset around: Azzalini and Bowman (1990) use a more complete version.

Harman 1974

Description

A correlation matrix of 24 psychological tests given to 145 seventh and eight-grade children in a Chicago suburb by Holzinger and Swineford.

Reference

Harman, H. H. (1976) Modern Factor Analysis, Third Edition Revised, University of Chicago Press, Table 7.4.

Terminology in SEM

SEM as it is known has been widely used in psychology and education for decades, while other disciplines have developed and advanced techniques that are related, but would not typically call them SEM. The following will be expanded over time.

Latent variable: an unobserved or hidden variable. It’s specific interpretation will depend on the modeling context.

Factor Analysis: in the SEM literature, this refers to a latent variable (measurement) model to assess the underlying construct behind the correlations among a set of observed variables. Elsewhere it may refer to a very broad family of matrix factorization techniques that would include things like principal components analysis, non-negative matrix factorization, etc.

Exploratory vs. Confirmatory: This distinction is problematic. Science and data analysis is inherently exploratory, and most who use SEM do some model exploration as they would with any other model. Some SEM models have more constraints than others, but that does not require a separate name or way of thinking about the model.

Mediation and moderation: These mean different things, both straightforward, and which is utilized in a model should be based on theoretical notions.

  • Mediation: an indirect effect, e.g. A->B->C, A has an indirect effect on C. A can have a direct effect on C too.
  • Moderation: an interaction (the same ones utilized in a standard regression modeling)

Fit: Model fit is something very difficult to ascertain in SEM, and notoriously problematic in this setting, where all proposed cutoffs for a good fit are ultimately arbitrary. Even if one had most fit indices suggesting a ‘good’ fit, there’s little indication the model has predictive capability.

Endo/Exogenous: Endogenous variables are determined by other variables, exogenous variables have no analyzed causes.

Disturbance: residual variance. Includes measurement error and unknown causes.

Mixture Models: models using categorical latent variables.

Resources

This list serves only as a starting point, though may be added to over time.

Graphical Models

UseR Series: Contains texts on graphical models, Bayesian networks, and network analysis.

Measurement Models

Personality Project: William Revelle’s website and text on psychometric theory.

SEM

Kline, Rex. Principles and Practice of Structural Equation Modeling. A very applied introduction.

Beaujean, A. A. (2014). Latent variable modeling using R: A step by step guide. New York, NY: Routledge/Taylor and Francis. Lavaan based guide to SEM

Other SEM

References

Kline, Rex B. 2015. Principles and Practice of Structural Equation Modeling. Guilford publications.

Pearl, Judea. 2009. Causality. Cambridge university press.


  1. There are other assumptions at work also, e.g. that the model is correct and there are no other confounders.

  2. Sewall Wright first used path analysis almost 100 years ago.

  3. I use multivariate here to refer to multiple dependent variables, consistent with multivariate analysis generally. Some use it to mean multiple predictors, but since you’re not going to see single predictor regression outside of an introductory statistical text, there is no reason to distinguish it. Same goes for multiple regression.

  4. In the past people would run separate OLS regressions to estimate mediation models, particularly for the most simple, three variable case. One paper that for whatever reason will not stop being cited is Baron & Kenny 1986. It was 1986. Please do not do mediation models like this. You will always have more than three variables to consider, and always have access to R or other software that can estimate the model appropriately. While I think it is very helpful to estimate your models in piecemeal fashion for debugging purposes and to facilitate your understanding, use appropriate tools for the model you wish to estimate. Some packages, such as <span class=‘pack>mediate, may still require separate models, but there is far more going on ’under the hood’ even then. For more recent work in this area, see the efforts of Pearl and Imai especially.

  5. I suspect this is likely the case for the majority of modeling scenarios.

  6. There is a statement “All results controlled for background covariates of vocabulary at age 4, gender, adoption status, and maternal education level.” and a picture of only the three-variable mediation part.

  7. No, ‘non’-recursive as a name for these models makes no sense to me either.

  8. Principal components, standard factor analysis and SEM can work on covariance/correlation matrices even without the raw data, this will be perhaps demonstrated in a later version of this doc.

  9. I don’t think it necessary to get into rotation here, though will likely add a bit in the future. If you’re doing PCA, you’re likely not really concerned about interpretation of loadings, as you are going to use the components for other means. It might help with standard factor analysis, but this workshop will spend time on more focused approaches where one would have some idea of the underlying structure rather than looking to uncover the structure. Rotation doesn’t change anything about the fundamental model result, so one just uses whatever leads to the clearest interpretation.

  10. One version of factor analysis is nearly identical to PCA in terms of mechanics, save for what are on the diagonals of the correlation matrix (1s vs. ‘communalities’).

  11. For your own sake, if you develop a questionnaire, make higher numeric values correspond to meaning ‘more of’ something, rather than in this backward fashion.

  12. Note that this is actually done for all disturbance/residual terms, as there is an underlying latent variable there which represents measurement error and the effect of unexplained causes. The path of that latent variable is fixed to 1, and its variance is the residual variance in the SEM output.

  13. See the blavaan package.

  14. Despite the name, there is nothing inherently ‘statistical’ about the DSM.

  15. One can set REML=F so as to use standard maximum likelihood and make the results directly comparable to lavaan.

  16. I personally cannot see bends with only four time points, at such that I couldn’t just as easily posit a linear trend.

  17. Note that k-means and other common cluster analysis techniques are not model based as in this example. In model-based approaches we assume some probabilistic data generating process (e.g. normal distribution) rather than some heuristic.

  18. By contrast, continuous latent variables are typically clearly theoretically motivated in typical SEM practice. Data set sizes would usually limit the number of groups to three at most, yet there is no reason to believe there would only be three distinct latent classes in any reasonable amount of data.